{"title": "Learning How to Teach or Selecting Minimal Surface Data", "book": "Advances in Neural Information Processing Systems", "page_first": 364, "page_last": 371, "abstract": null, "full_text": "Learning  How To  Teach \n\nor \n\nSelecting Minimal Surface  Data \n\nDavi Geiger \n\nRicardo A.  Marques Pereira \n\nSiemens Corporate  Research,  Inc \n\n755  College  Rd.  East \nPrinceton,  NJ  08540 \n\nUSA \n\nDipartimento di Informatica \n\nUniversita di  Trento \n\nVia Inama 7,  Trento,  TN  38100 \n\nITALY \n\nAbstract \n\nLearning a  map from an input set  to an output set is similar to the prob(cid:173)\nlem of reconstructing  hypersurfaces  from sparse  data (Poggio and  Girosi, \n1990).  In this framework,  we  discuss  the  problem of automatically select(cid:173)\ning  \"minimal\"  surface  data.  The objective  is  to be  able to approximately \nreconstruct  the  surface  from the selected  sparse  data.  We  show  that this \nproblem is  equivalent  to  the  one  of compressing  information by  data re(cid:173)\nmoval and the one oflearning how to teach.  Our key step is to introduce a \nprocess  that statistically selects  the data according  to the  model.  During \nthe process  of data selection  (learning how  to teach)  our system  (teacher) \nis  capable  of predicting  the  new  surface,  the  approximated  one  provided \nby  the  selected  data.  We  concentrate  on  piecewise  smooth surfaces,  e.g. \nimages,  and  use  mean field  techniques  to  obtain a  deterministic  network \nthat is shown to compress  image data. \n\n1  Learning and  surface  reconstruction \n\nGiven a dense input data that represents a hypersurface, how could we automatically \nselect  very few  data points such  as  to be able to use  these fewer  data points (sparse \ndata)  to approximately reconstruct  the hypersurface  ? \nWe  will be  using the term surface  to refer  to hypersurface  (surface  in multidimen-\n\n364 \n\n\fLearning How to  Teach or Selecting Minimal Surface Data \n\n365 \n\nsions)  throughout the paper. \nIt has  been shown  (Poggio  and  Girosi,  1990)  that  the  problem of reconstructing  a \nsurface  from  sparse  and  noisy  data is  equivalent  to  the  problem  of learning  from \nexamples.  For  instance,  to  learn  how  to  add  numbers  can  be  cast  as finding  the \nmap from X  = {pair 01 numbers} to F  = {sum} from a set of noisy examples.  The \nsurface is  F(X) and the sparse  and noisy  data are the set of N  examples {(Xi, di)}, \nwhere  i = 0,1, ... , N  and  Xi  = (ai, bi) E X, such  that ai + bi  = di + TJi  (TJi  being the \nnoise term).  Some  a priori information about the surface,  e.g.  the smoothness one, \nis  necessary  for  reconstruction. \nConsider  a  set  of  N  input-output  examples,  {(Xi, di)},  and  a  form  II  PI  112  for \nthe  cost  of the  deviation of I,  the  approximated surface,  from  smoothness.  P  is a \ndifferential  operator  and II  . II  is  a  norm  (usually  L2).  To find  the  surface  I,  that \nbest  fits  (i)  the  data  and  (ii)  the  smoothness  criteria,  is  to  solve  the  problem  of \nminimizing the functional \n\nN-l \n\nV(f) = L (di  - I(Xi\u00bb2 + #11  PI W \n\ni=O \n\nDifferent  methods of solving  the  function  can yield  different  types  of network.  In \nparticular  using  the  Green's  method  gives  supervised  backprop  type  of networks \n(Poggio and Girosi,  1990) and using optimization techniques  (like gradient descent) \nwe  obtain unsupervised  (with feedback)  type  of networks. \n\n2  Learning how  to teach  arithmetic operations \n\nThe  problem  of learning  how  to  add and  multiply is  a  simple one  and  yet  provide \ninsights to our  approach of selecting the minimum set  of examples. \n\nLearning arithmetic operations  The surface given by the addition of two num(cid:173)\nbers,  namely I(x, y)  =  X + y,  is  a  plane  passing through  the  origin.  The multipli(cid:173)\ncation surface,  I(x, y)  = Xv,  is  hyperbolic.  The  a  priori knowledge  of the  addition \nand  multiplication surface  can be expressed  as  a  minimum of the functional \n\nV(f) = 1: 1: II  yr 2/(x,y) II  dxdy \n\nwhere \n\nyr2/(x, y)  = ({}x 2 + {}y2  )/(x, y) \n\n{}2 \n\n{}2 \n\nOther functions  also minimize V(f),  like  I(x, y)  = x2 - y2,  and so  a few  examples \nare necessary  to learn how to add and multiply given the above prior knowledge.  If \nthe prior assumption consider  a larger class of basis functions,  then more examples \nwill be required.  Given p input-output examples, {(Xi, Vi); di}, the learning problem \nof adding and multiplying can be  cast  as  the optimization of \n\n\f366 \n\nGeiger and Pereira \n\np-l \n\nV(f) = ~(f( X\"  y,) - d,)' + Jl 100 100 II  \\1' I( x, y)  II  d xd y \n\n00 \n\n00 \n\nWe  now  consider  the problem of selecting  the examples from the full  surface  data. \n\nA  sparse  process  for  selecting data  Let  us  assume  that  the full  set  of data \nis  given. in  a  2-Dimensionallattice.  So  we  have  a  finite  amount of data  (N 2  data \npoints), with the input-output set being {(Xi, Yj); dij}, where i, j  = 0, 1, ... , N -1. To \nselect  p examples we  introduce a  sparse process  that selects  out data by  modifying \nthe cost  function  according to \n\n00 \n\n00 \n\nN-l \n\nN-l \n\nV = ,~y-8,;)(f(X\"y;)-d';)'+Jl 100 100 II  \\1'I(x,y) II +A(p-i~O (1-8,;\u00bb' \nwhere  Sij  = 1 selects  out the  data and we  have  added  the last term  to assure  that \np  examples  are  selected.  The data term forces  noisy  data to  be  thrown  out  first, \nthe second  order smoothness of I  reduces  the  need for  many examples  (p  ~ 10)  to \nlearn  these  arithmetic operations.  Learning  S  is  equivalent  to learn  how  to  select \nthe examples,  or  to learn  how  to teach.  The system  (teacher)  has  to learn a  set  of \nexamples (sparse data) that contains all the  \"relevant\"  information.  The redundant \ninformation can be  \"filled in\"  by the prior knowledge.  Once the teacher has learned \nthese  selected  examples,  he,  she  or  it  (machine)  presents  them to the student  that \nwith  the  a priori knowledge  about surfaces  is  able  to approximately learn  the full \ninput-output map (surface). \n\n3  Teaching piecewise smooth surfaces \n\nWe  first  briefly  introduce  the  weak  membrane  model,  a  coupled  Markov  random \nfield  for  modeling piecewise smooth surfaces.  Then we  lay down the framework for \nlearning to teach  this surface. \n\n3.1  Weak membrane model \n\nWithin the  Bayes  approach  the  a priori knowledge  that surfaces  are smooth  (first \norder smoothness)  but not at the discontinuities has been  analyzed by  (Geman and \nGeman, 1984)  (Blake and Zisserman,  1987) (Mumford and Shah, 1985) (Geiger and \nGirosi,  1991).  If we  consider  the  noise  to  be  white  Gaussian,  the  final  posterior \nprobability becomes  P(j,/lg) = ie-,I3VU,l) , where \n\nV(j,/) =  I)(jij - gij)2 + J1.11  'VI Ilrj  (1-lij) +,ijlij], \n\ni,j \n\n(1) \n\nWe  represented  surfaces by  lij  at pixel  (i, j),  and  discontinuities by  lij.  The input, \ndata is gij,  II  'V I  Ilij is  the norm of the gradient  at pixel (i, j).  Z  is a  normalization \n\n\fLearning How to Teach or Selecting Minimal Surface Data \n\n367 \n\nconstant, known as the partition function.  f3  is a global parameter of the model and \nis inspired on thermodynamics, and J.L  and lij  are parameters to be estimated.  This \nmodel,  when  used  for  image segmentation, has  been shown  to give  a  good pattern \nof  discontinuities  and  eliminate  the  noise.  Thus,  suggesting  that  the  piecewise \nassumption is valid for  images. \n\n3.2  Redundant data \n\nWe  have assumed the surface  to be smooth and therefore  there is  redundant  infor(cid:173)\nmation within smooth regions.  We then propose a model that selects the  \"relevant\" \ninformation according to two criteria \n\n1.  Discontinuity  data:  Discontinuities  usually  capture  relevant  information, \nand it is  possible to roughly  approximate surfaces just using edge  data (see  Geiger \nand  Pereira,  1990).  A  limitation of just  using  edge  data is  that  an  oversmoothed \nsurface is  represented. \n\n2.  Texture data:  Data points that have significant gradients (not enough  to be \na  discontinuity)  are  here  considered  texture  data.  Keeping  texture  data allows  us \nto  distinguish  between  flat  surfaces,  as  for  example  a  clean  sky  in  an  image,  and \ntexture surfaces,  as for  example the leaves in the tree  (see  figure  2). \n\n3.3  The sparse process \n\nAgain,  our proposal  is  first  to  extend  the  weak  membrane model  by  including an \nadditional  binary  field  - the  sparse  process  s- that  is  1  when  data is  selected  out \nand  0 otherwise.  There  are  natural connections  between  the  process  s  and  robust \nstatistics  (Huber,  1988)  as  discussed  in  (Geiger  and Yuille,  1990)  and  (Geiger  and \nPereira,  1991).  We  modify (1)  by  considering  (see  also Geiger  and  Pereira,  1990) \n\nV(/, I, s) = 2:)(1 - Sij )(fij - gij)2 + J.L  II  'V I  II;j  (1  -lij) + TJijSij  + lijlij]. \n\n(2) \n\ni,j \n\nwhere  we  have  introduced  the  term  TJijSij  to  keep  some  data  otherwise  Sij  = 1 \neverywhere.  If the  data term  is  too large,  the  process  S  = 1  can  suppress  it.  We \nwill  now  assume  that  the  data  is  noise-free,  or  that  the  noise  has  already  been \nsmoothed  out.  We  then  want  to  find  which  data points  (s  = 0)  are  necessary  to \nkeep  to reconstruct  I. \n\n3.4  Mean field  equations and unsupervised networks \n\nTo impose the discontinuity data constraint we use the hard constraint technique \n(Geiger  and  Yuille,  1990  and  its  references).  We  do  not  allow  states  that  throw \nout  data  (Sij  = 1)  at  the  edge  location  (lij  = 1).  More  precisely,  within  the \nstatistical  framework  we  reduce  the  possible  states  for  the  processes  S  and  I  to \nSij1ij  = O.  Therefore,  excluding  the  state  (Sij  =  1,/ij  = 1).  Applying  the  saddle \npoint  approximation,  a  well  known  mean field  technique  (Geiger  and  Girosi,  1989 \nand its references),  on the field  I,  we  can  compute the partition function \n\n\f368 \n\nGeiger and Pereira \n\nZ  = \n\nL \n\ns.l=O \n\nL \n\ne-f3V (j,l,s)  ~  L \n\ns.1=O \n\ne-f3VCf,l,s)  ~ II Zij \n\nf=(0, .. ,255)N2 s,1=(0 ,1)N2 \n\nij \n(e- f3 h'ij +Cfi j -9i j )2] + e- f3 [JlIIVfll:j+T/;j] + e-f3[JlIIVfll~j+(jij-9,j)2]) \n\ns,1=(0,1)N2 \n\n(3) \n\nZij \n\nwhere f maximizes Z.  After applying mean field  techniques we  obtain the following \nequations for  the  processes  I and  S \n\nand, using the definition II  \\l f  IIlj = [(fi,j+l - fi+l,j)2 + (Ji+l,j+l -\nfield  self consistent  equation (Geiger  and Pereira,  1991)  becomes \n\nfi,j)2  , the mean \n\n(4) \n\n-J.L{ f{ij(1  - ~j) + f{i-l,j-l(l- [i-l,j-l) + \nMi -1 ,j (1  - ~ -1 ,j ) + Mi ,j -1 (1  -\n\nIi ,j -1) } \n\n(5) \n\nfi,j)2  and  Mij  = (Ji+l,j  -\n\nwhere  f{ij  = (fi+l,j+l  -\nfi,j+l?'  The  set  of coupled \nequations  (5)  (4)  can  be  mapped  to  an  unsupervised  network,  we  call  a  minimal \nsurface representation network (MSRN), and can efficiently be solved in a massively \nparallel machine.  Notice  that  Sij  + lij  ~ 1,  because  of the hard  constraint,  and in \nthe limit of j3  --+  00 the processes  S  and I  becomes either  0 or  1.  In order  to throw \naway  redundant  (smooth)  data keeping  some  of the  texture  we  adapt the  cost  TJij \naccording  to the gradient of the surface.  More  precisely,  we  set \n\n(6) \n\nwhere  (ilfjg)2  =  (gi+l,j  --gi_l,j)2  and  (ilijg)2  =  (9i,j+l  - 9i,j_l)2.  The smoother \nis the data the lower is  the  cost  to discard the data (Sij  = 1).  In the limit of TJ  --+ 0 \nonly edge  data (lij  = 1)  is  kept,  since from  (4)  limT/-+osij  = l-lij . \n\n3.5  Learning how  to teach and the approximated surface \nWith  the  mean field  equations  we  compute  the  approximated  surface  f  simulta(cid:173)\nneously  to  S  and  to  I.  Thus,  while  learning  the  process  S  (the  selected  data)  the \nsystem  also  predict  the  approximated  surface  f  that  the  student  will  learn  from \nthe selected  examples.  By  changing the  parameters,  say  J.L  and  TJ,  the  teacher  can \nchoose  the  optimal parameters such  as  to select  less  data and  preserve  the  quality \nof the  approximat~d surface.  Once  S  has  been  learned  the  system  only  feeds  the \nselected  data points to the learner  machinery.  We  actually  relax  the condition and \nfeed  the learner with the selected  data and the corresponding discontinuity map (l). \nNotice  that  in the  limit of TJ  --+  0 the  selected  data points are  coincident  with  the \ndiscontinuities  (I = 1). \n\n\fLearning How to  Teach or Selecting Minimal Surface Data \n\n369 \n\n4  Results:  Image compression \n\nWe show the results of the algorithm to learn the minimal representation of images. \nThe algorithm is  capable of image compression  and  one  advantage over  the  cosine \ntransform (traditional method) is that it does not have the problem of breaking the \nimages into blocks.  However,  a  more careful  comparison is needed. \n\n4.1  Learning s,  f,  and I \n\nTo analyze  the  quality of the  surface  approximation, we  show  in figure  1  the  per(cid:173)\nformance  of the  network  as  we  vary  the  threshold  1].  We  first  show  a  face  image \nand  the  line  process  and  then  the  predicted  approximated surfaces  together  with \nthe  correspondent  sparse  process  s. \n\n4.2  Reconstruction, Generalization or \"The student performance\" \n\nWe  can now test how the student learns from the selected  examples, or how good is \nthe surface  reconstruction from the selected  data.  We  reconstruct  the approximate \nsurfaces  by  running  (5)  again,  but  with  the  selected  surface  data points  (Sij  = 0) \nand the discontinuities  (iij  = 1)  given from the previous step.  We show in figure  2f \nthat indeed we  obtain the  predicted surfaces  (the student  has learned). \n\nReferences : \n\nE.  B.  Baum  and  Y.  Lyuu.  1991.  The  transition  to  perfect  generalization  in  perceptrons, \nNeural  Computation, vo1.3,  no.3.  pp.386-401. \n\nA.  Blake  and  A.  Zisserman.  1987.  Visual  Reconstruction,  MIT Press,  Cambridge,  Mass. \n\nD.  Geiger  and  F.  Girosi.  1989.  Coupled  Markov  random  fields  and  mean  field  theory, \nAdvances in  Neural  Information  Processing  Systems  2,  Morgan  Kaufmann,  D.  Touretzky. \n\nD.  Geiger  and  A.  Yuille.  1991.  A common framework  for  image segmentation,  Int.  Jour. \nCompo  Vis.,vo1.6:3,  pp.  227-243. \n\nD.  Geiger  and  F.  Girosi.  1991.  Parallel  and  deterministic  algorithms  for  MRFs:  surface \nreconstruction,  PAMI,  May  1991,  vol.PAMI-13,  5,  pp.401-412  . \n\nD.  Geiger  and  R.  M.  Pereira.  1991.  The  outlier  process,  IEEE  Workshop  on  Neural \nNetworks for signal Processing,  Princeton,  N J. \n\nS.  Geman  and  D.  Geman.  1984.  Stochastic  Relaxation,  Gibbs  Distributions,  and  the \nBayesian  Restoration  of Images,PAMI,  vol.PAMI-6,  pp.721-741K. \nJ.J. Hopfield.  1984.  Neural networks  and  physical  systems  with emergent collective  com(cid:173)\nputational  abilities,  Proc.  Nat.  Acad.  Sci.,79  , pp.  2554-2558. \n\nP.J.  Huber.  1981.  Robust  Statistics,  John  Wiley  and Sons,  New  York. \nD.  Mumford  and  J.  Shah.  1985.  Boundary  detection  by  minimizing  functionals,  I  ,  Proc. \nIEEE  Conf.  on  Computer  Vision  &  Pattern Recognition, San  Francisco,  CA  . \n\nT.  Poggio  and  F.  Girosi.  1990.  Regularization  algorithms  for learning  that  are equivalent \nto  multilayer  network,  Science,vol-247,  pp.  978-982. \nD.  E.  Rumelhart,  G.  Hinton  and  R.  J.  Willians.  1986.  Learning  internal  representations \nby error backpropagation.  Nature,  323,  533. \n\n\f370 \n\nGeiger and Pereira \n\nf \n\na. \n\nc. \n\ne. \n\nh ... p \n\n. .  \n\nd. \n\nf. \n\n(b)  The  edge  map  for  J-l \n\nFigure  1:  (a)  8-bit  image  of 128  X  128  pixels. \n:::::  1.0, \n'Yij  :::::  100.0.  After 200 iterations and final f3  :::::  25  ~ 00  (c)  the  approximated image \n:::::  0.0009.  (d)  the  corresponding  sparse  process  (e) \nfor  J-l \n:::::  0.0001.  (f)  the  corresponding \napproximated  image  J-l \nsparse  process. \n\n:::::  0.01,  'Yij  = 1.0  and  TJ \n\n:::::  0.01,  'Yij  :::::  1.0  and  TJ \n\n\f", "award": [], "sourceid": 559, "authors": [{"given_name": "Davi", "family_name": "Geiger", "institution": null}, {"given_name": "Ricardo", "family_name": "Pereira", "institution": null}]}