{"title": "A Multiscale Attentional Framework for Relaxation Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 633, "page_last": 639, "abstract": "", "full_text": "A  Multiscale Attentional Framework for \n\nRelaxation Neural Networks \n\nDimitris I.  Tsioutsias \n\nEric Mjolsness \n\nDept.  of Electrical  Engineering \n\nDept.  of Computer Science  &  Engineering \n\nYale University \n\nNew  Haven,  CT 06520-8285 \ntsioutsias~cs.yale.edu \n\nUniversity of California, San Diego \n\nLa Jolla, CA 92093-0114 \n\nemj~cs.ucsd.edu \n\nAbstract \n\nWe  investigate  the  optimization  of neural  networks  governed  by \ngeneral  objective functions.  Practical formulations of such  objec(cid:173)\ntives  are  notoriously  difficult  to solve;  a  common problem  is  the \npoor  local extrema that  result  by  any  of the  applied  methods.  In \nthis paper, a novel framework is introduced for the solution oflarge(cid:173)\nscale  optimization problems.  It assumes  little about  the objective \nfunction and can be applied to general nonlinear, non-convex func(cid:173)\ntions;  objectives  in thousand  of variables  are  thus efficiently  min(cid:173)\nimized  by  a  combination of techniques  - deterministic  annealing, \nmultiscale optimization, attention mechanisms and trust region op(cid:173)\ntimization methods. \n\n1 \n\nINTRODUCTION \n\nMany practical problems in computer vision, pattern recognition , robotics and other \nareas  can  be  described  in terms  of constrained  optimization .  In  the  past  decade, \nresearchers  have  proposed  means  of solving  such  problems with  the  use  of neural \nnetworks  [Hopfield  &  Tank,  1985;  Koch  et  ai.,  1986],  which  are  thus  derived  as \nrelaxation dynamics for  the objective functions  codifying the optimization task. \n\nOne  disturbing  aspect  of the  approach  soon  became  obvious,  namely  the  appar(cid:173)\nent inability of the methods to scale up  to practical problems, the principal reason \nbeing the rapid increase in the number of local minima present  in the objectives as \nthe dimension of the problem increases.  Moreover most objectives, E( v), are highly \nnonlinear, non-convex functions of v , and simple techniques  (e.g.  steepest  descent) \n\n\f634 \n\nD. I. TSIOUTSIAS, E.  MJOLSNESS \n\nwill , in general, locate the first  minimum from  the starting point. \n\nIn this  work,  we  propose a framework for  solving large-scale instances of such opti(cid:173)\nmization problems.  We  discuss several  techniques which  assist in avoiding spurious \nminima and whose combined result  is  an objective function solution that is compu(cid:173)\ntationallyefficient, while at the same time being globally convergent.  In section  2.1 \nwe  discuss the use of deterministic annealing as a means of avoiding getting trapped \ninto  local  minima.  Section  2.2  describes  multiscale representations  of the  original \nobjective in reduced  spatial domains.  In section  2.3  we  present  a scheme for  reduc(cid:173)\ning the computational requirements  of the optimization method used,  by means of \na  focus  of attention  mechanism.  Then,  in section  2.4  we  introduce  a  trust  region \nmethod for  the relaxation phase of the framework,  which uses second order informa(cid:173)\ntion (i.e.  curvature) of the objective function.  In section  3 we  present  experimental \nresults on the application of our framework to a  2-D region segmentation objective \nwith  discontinuities.  Finally, section 4 summarizes our presentation. \n\n2  THEORETWALFRAMEWORK \n\nOur optimization framework  takes  the form of a  list of nested  loops indicating the \norder  of conceptual  (and  computational) phases  that occur:  from  the outer  to the \ninner loop  we  make use  of deterministic annealing,  a  multiscale representation ,  an \nattentional mechanism  and a  trust  region optimization method. \n\n2.1  ANNEALING  NETS \n\nThe  usefulness  of statistical  mechanics  for  designing  optimization procedures  has \nrecently  been  established;  prime examples  are simulated annealing  and its various \nmean  field  theory  approximations  [Hopfield  &  Tank,  1985;  Durbin  &  Willshaw, \n1987].  The success  of such  methods is  primarily due to  entropic terms included  in \nthe  objective  (i .e.  syntactic terms),  but  the  price  to  pay is  their  highly  nonlinear \nform.  Interestingly, those terms can effectively  be convexified  by the use of a  \"tem(cid:173)\nperature\"  parameter, T , allowing for  a  reduction in the number of minima and the \nability to  track the solution through  \"temperature\". \n\n2.2  MULTISCALE  REPRESENTATION \n\nTo  solve  large-scale  problems  in  thousands  of variables,  we  need  to  speed  up  the \nconvergence  of the  method  while  still  retaining  valid state-space  trajectories.  To \naccomplish this we introduce smaller, approximate versions of the problem at coarser \nspatial scales  [Mjolsness  et  al. ,  1991] ;  the  nonlinearity  of the  original  objective  is \nmaintained  at  all scales,  as  opposed  to other  approaches  where  the objectives  and \ntheir  derivatives  are  either  approximated  by  the  use  of finite  difference  methods , \nor solved for  by  multigrid techniques  where  a  quadratic  objective  is  still  assumed. \nConsequently, the multiscale representation exploits the effective  smoothness in the \nobjectives:  by  alternating  relaxation  phases  between  coarser  and  finer  scales,  we \nuse  the former  to  identify extrema and the latter to  localise  them. \n\n2.3  FOCUS  OF  ATTENTION \n\nTo further  reduce  the computational requirements of larg~scale optimization (and \nindirectly control its temporal behavior), we  use  a  focus  of attention (FoA)  mecha(cid:173)\nnism  [Mjolsness  &  Miranker,  1993],  reminiscent  of the  spotlight hypothesis  argued \n\n\fA Multiscale Attentional  Framework for Relaxation  Neural  Networks \n\n635 \n\nto exist in early vision systems  [Koch &  Ullman, 1985; Olshausen  et  al.,  1993].  The \neffect  of a  FoA  is  to support efficient,  responsive  analysis:  it allows  resources  to be \nfocused  on  selected  areas  of a  computation  and  can  rapidly  redirect  them  as  the \ntask requirements evolve. \n\nSpecifically,  the  FoA  becomes  a  characteristic  function,  7l'(X) ,  determining  which \nof the  N  neurons  are active  and  which  are  clamped during  relaxation,  by  use  of a \ndiscrete-valued vector,  X,  and by the rule:  7l'i(X)  =  1 if neuron Vi  is in the FoA,  and \nzero otherwise.  Moreover,  a limited number, n, of neurons Vi  are active at any given \ninstant:  I:i 7l'i(X)  = n, with n\u00ab Nand n  chosen  as an optimal FoA  size.  To tie the \nattentional  mechanism  to  the  multiscale  representation,  we  introduce  a  partition \nof the  neurons  Vi  into  blocks  indexed  by  a  (corresponding  to  coarse-scale  block(cid:173)\nneurons),  via a  sparse  rectangular  matrix Bia  E  {O,  I}  such  that  I:a Bia  =  1,  Vi, \nwith  i  =  1,  ... ,N, a  =  1,oo.,K and  K\u00abN.  Then 7l'i(X)  =  I:aBiaXa,  and  we  use \neach  component of X for  switching a different  block of the partition; thus,  a  neuron \nVi  is  in  the  FoA  iff its  coarse  scale  block  a  is  in  the  FoA,  as  indicated  by  Xa.  As \na  result,  our  FoA  need  not necessarily  have a  single region  of activity:  it may well \nhave  a  distributed  activity pattern as determined by the partitions Bia. 1 \n\nClocked  objective  function  notation  [Mjolsness  &  Miranker,  1993]  makes  the  task \nmore apparent:  during the active-x phase the FoA  is  computed for  the next active(cid:173)\nv  phase, determining the subset of neurons Vi  on which optimization is to be carried \nout.  We  introduce the quantity E ;dv] ==  g~ ~ (Ti  is  a  time axis for  Vi)  [Mjolsness \n&  Miranker, 1993]  as an estimate of the predicted dE arising from each Vi  if it joins \nthe  FoA.  For  HopfieldjGrossberg dynamics this measure becomes: \n\nE ;d v ] =  _g~(gi1(Vi)) (~~) 2  ==  -gHU i)(E,i)2 \n\n(1) \n\nwi th  E,i  ~f 'V'i E,  and  gi  the  transfer  function  for  neuron  Vi  (e.g.  a  sigmoid func(cid:173)\ntion).  Eq.  (1)  is  used  here  analogously  to  saliency  measures  introduced  into  neu(cid:173)\nrophysiological  work  [Koch  &  Ullman,  1985];  we  propose  it  as  a  global  measure \nof  conspicuousness.  As  a  result,  attention  becomes  a  k-winner-take-all  (kWTA) \nnetwork: \n\na \n\na \n\nwhere  I refers  to the scale for  which the  FoA  is  being  determined (I  =  1, ... , L),  EEl \nconforms  with  the  clocked  objective  notation,  and  the last  summand  corresponds \nto  the  subspace  on  which  optimization is  to  be  performed,  as  determined  by  the \ncurrent  FoA.2  Periodically, an analogous FoA  through spatial scales is run, allowing \nre-direction  of system  resources  to  the scale  which  seems  to be  having the  largest \ncombined benefit and cost effect on the optimization [Tsioutsias &  Mjolsness,  1995]. \nThe combined effect of multiscale optimization and FoA  is depicted schematically in \nFig. 1:  reduced-dimension functionals are created and a  FoA  beam  \"shines\"  through \nscales  picking the neurons to work on. \n\n1  Preferably,  Bia  will  be chosen  to minimize  the number of inter-block  connections. \n2  Before computing a new FoA we update the neighbors of all neurons that were included \n\nin  the last focus;  this  has a  similar effect  to an  implicit  spreading of activation. \n\n\f636 \n\nD. I. TSIOUTSIAS, E. MJOLSNESS \n\nLayer 3 \n\nLayer 1 \n\nFigure 1:  Multiscale Attentional Neural  Nets:  FoA  on a layer (e.g.  L=l) competes \nwith another  FoA  (e.g.  L=2)  to determine both preferable scale  and subspace. \n\n2.4  OPTIMIZATION  PHASE \n\nTo  overcome the  problems generally  associated  with  the steepest  descent  method, \nother techniques have been devised .  Newton 's method , although successful  in small \nto medium-sized problems, does not scale well  in large non-convex instances  and is \ncomputationally intensive.  Quasi-Newton  methods  are  efficient  to  compute,  have \nquadratic  termination but  are  not  globally convergent  for  general  nonlinear,  non(cid:173)\nconvex functions.  A  method that guarantees  global convergence  is  the trust region \nmethod  [Conn  et  al.,  1993] .  The  idea is  summarized as  follows :  Newton's  method \nsuffers  from  non-positive  definite  Hessians; in such  a  case,  the underlying function \nm(k)(6)  obtained from  the 2nd order  Taylor expansion  of E(Vk + 6)  does  not have \na  minimum and  the  method  is  not  defined,  or  equivalently, the  region  around  the \ncurrent point Vk  in which the Taylor series is adequate does not include a minimizing \npoint of m(k)(6).  To resolve  this, we  can define  a  neighborhood Ok  of Vk  such  that \nm(k)(6)  agrees  with  E(Vk + 6)  in some sense;  then,  we  pick Vk+l  = Vk + 6 k , where \n6 k  minimizes  m(k)(6) ,  V(Vk  + 6)  E  Ok .  Thus,  we  seek  a  solution  to the  resulting \nsubproblem: \n\n(3) \n\nwhere  1I \u00b7lIp  is  any kind of norm (for instance, the L2  norm leads to the Levenberg(cid:173)\naccuracy  ratio  Tk  =  (~E(k)/~m(k)  = (E(k )  - E(Vk + 6k\u00bb/(m(k)(O)  - m(k)(6 k\u00bb; \nMarquardt methods) , and ~k is  the radius of Ok,  adaptively modified based on an \n\n~E(k) is  the  \"actual  reduction\"  in  E(k)  when  step  6 k  is  taken,  and  ~m(k) the \n\"predicted  reduction\" .  The closer  Tk  is  to unity, the  better the agreement  between \nthe  local  quadratic  model  of E (k)  and  the  objective  itself is,  and  ~k is  modified \nadaptively to reflect  this  [Conn  et  al.,  1993]. \n\nWe  need  to make some brief points here  (a complete discussion  will  be given else(cid:173)\nwhere  [Tsioutsias &  Mjolsness,  1995]): \n\n\fA Multiscale  Attentional Framework for Relaxation  Neural  Networks \n\n637 \n\n\u2022  At each spatial scale of our multiscale representation, we optimize the corre(cid:173)\n\nsponding objective by applying a  trust region method.  To obtain sufficient \nrelaxation progress  as  we  move through scales  we  have  to maintain mean(cid:173)\ningful region sizes,  Llk;  to that end we use a criterion based on the curvature \nof the functionals along a searching  direction. \n\n\u2022  The dominant relaxation computation within the algorithm is  the solution \nof eq.  (3).  We have chosen  to solve  this subproblem with a  preconditioned \nconjugate  gradient  method  (PCG)  that  uses  a  truncated  Newton  step  to \nspeed  up  the  computation;  steps  are  accepted  when  a  sufficiently  good \napproximation to  the  quasi-Newton step  is  found. 3  In  our  case,  the  norm \nin  eq.  (3)  becomes  the  elliptical  norm  1I~llc  =  ~tc~,  where  a  diagonal \npreconditioner  to the Hessian  is  used  as the scaling matrix C. \n\n\u2022  If the  neuronal  connectivity  pattern  of the original  objective  is  sparse  (as \n\nhappens for  most practical combinatorial optimization problems),  the pat(cid:173)\ntern of the resulting Hessian can readily be represented  by sparse static data \nstructures,4 as we have done within our framework.  Moreover, the partition \nmatrices,  Bia, introduce a  moderate fill-in in the coarser objectives  and the \nsparsity  of the corresponding  Hessians  is  again taken into account. \n\n3  EXPERIMENTS \n\nWe  have  applied  our  proposed  optimization  framework  to  a  spatially  structured \nobjective from  low-level  vision,  namely smooth  2-D  region  segmentation  with  the \ninclusion of discontinuity detection  processes: \n\nij \n\nij \n\nij \n\nij \n\nij \n\nwhere d  is the set of image intensities,  j  is the real-valued smooth surface to be fit to \nthe data, lV  and lh  are the discrete-valued  line  processes indicating a non-zero  value \nin the intensity gradient, and \u00a2(x) =  -(2go)-1[lnx+ln(1-x)] is  a  barrier function \nrestricting  each  variable  into  (0,1)  by  infinite  barriers  at  the  borders.  Eq.  (4)  is \na  mixed-nonlinear  objective  involving  both  continuous  and  binary  variables;  our \nframework optimizes vectors  j, lh  and lV  simultaneously at  any given scale  as  con(cid:173)\ntinuous  variables,  instead  of earlier  two-step,  alternate  continuous/discrete-phase \napproaches  [Terzopoulos,  1986]. \n\nWe have tested our method on gradually increasing objectives,  from  a  \"small\"  size \nof N=12,288 variables for a 64x64 image, up to a large size of N=786 ,432 variables \nfor  a  512x512 image; the results seem to coincide with our theoretical expectations: \na  significant  reduction  in  computational cost  was  observed  and  consistent  conver(cid:173)\ngence towards the optimum of the objective was found for  various numbers of coarse \nscales  and  FoA  sizes.  The dimension of the objective at  any scale  I  was  chosen  via \na  power law:  N(L-l+1)! L,  where  L  is  the total number of scales  and  N  the size  of \n\n3  The  algorithm  can  also  handle  directions  of negative curvature. \n4  This  property  becomes important  in  a  neural  net implementation. \n\n\f638 \n\nthe original objective. \n\nD. I. TSIOUTSIAS, E. MJOLSNESS \n\nThe effect  of our multiscale optimization with and without a  FoA  is shown in Fig.  2 \nfor  the  128x128 and  the  512x512  nets,  where  E( v*)  is  the  best  final  configuration \nwith a one-level  no-FoA  net , and  cumulative  cost is  an accumulated measure in the \nnumber of connection updates at each scale;  a  consistent scale-up in computational \nefficiency  can be noted when  L  >  1,  while the cost  measure also reflects  the relative \ntotal wall-clock times  needed  for  convergence.  Fig.  3 shows  part of a  comparative \nstudy  we  made for  saliency  measures  alternative to eq.  (1)  (e.g.  g~IE,il), in order \nto  investigate  the  validity  of eq.  (1)  as  a  predictor of l:!..E: \nthe  more  prominent \n\"linearity\"  in the left scatterplot seems to justify our choice of saliency. \n\n104  . - -___  M-'S-'-/_A_T_N_e_t_s ,..,,(_12_8_t2-,)_: _L_=--,1 ,_2'-,3 ___  ---, \n\n10' \n\n10' \n\n10' \n\n10-' \n\n10\" \n\n10-110 \n\n2 \n\nNl \n\nMS/ AT  Nets  (512t2) :  L=1,2,3,4 \n\n#1 \n\n10' \n\n10' \n\n10' \n\n10' \n\n10' \n\n~ 10 l \n\n~ '\" \nI  10' \n>' \ng10 - 1 \n\n10-' \n\n10\" \n\n10-4 \n\n10-' \n\n2000 \n\n10-' 0 \n\n60000 \n\nFigure  2:  Multiscale  Optimization  (curves  labeled  by  number of scales  used):  #(cid:173)\nnumbered curves  correspond  to nets  without a  FoA , simply-numbered ones  to nets \nwith  a  FoA  used  at  all  scales.  The  lowest  costs  result  from  the  combined  use  of \nmultiscale optimization and  FoA. \n\n4  CONCLUSION \n\nWe  have  presented  a  framework for  the optimization of large-scale  objective func(cid:173)\ntions  using  neural  networks  that  incorporate  a  multiscale  attentional  mechanism. \nOur method allows for  a  continuous adaptation of the system resources  to the com(cid:173)\nputational  requirements  of the  relaxation  problem  through  the  combined  use  of \nseveral  techniques.  The  framework  was  applied  to  a  2-D  image segmentation ob(cid:173)\njective  with  discontinuities; formulations of this  problem  with  tens  to  hundreds  of \nthousands of variables  were  then successfully  solved. \n\nAcknow ledgements \n\nThis work was supported partly by AFOSR-F49620-92-J-0465 and the  Yale  Center \nof Theoretical  and  Applied Neuroscience. \n\n\fA Multiscale Attentional  Framework for Relaxation Neural Networks \n\n639 \n\n10' (128t2)  :  Focus  on  1st  level  - proposed  saliency \n\n10'  (128t2)  :  Focus  on  1st  level  - absolute  gradient \n\n10' \n\n~ 10\u00b0 \no o \n:0 \n~ 10-' \n\n.!! .. \n8-\n,.,10-' \no c \n. !! \nOJ \n11l  10-3 \n\n\" .. \n\n~ \n!10- 4 \n\n.. \n\n8 \n\n.. \n\n8 \n\n00 \n00 \n\",00 \n\n0 \n\no \no \n\no \n\n0 \n\no \no \n\no \no \n\n10' \n\n,. \n:0 .. o \n\no o \n\n:  10' \n\" 0. \n~ c \n\n~10-1 \n\n.!! .. \n~ \n~ \n\n0 \n0 \n\n3 \n\n0 \n\n0 \n\n0 \n0 \n0 \n\n8  o  8 \n~0o; \n\n.. \n\n/ .0   0 \n\"1:,00  0 \n\n0\" \n\n. .  0 \n\n10-' \n\n10-~0~-.-'-'-'u.tl~Oo.,-- J....Ll.J\"!'1*=0-.-'-'-~1 O:=.-.l.....L..Ll.';t'!loO r-'-~.tO!:.-r-u~1~0-:r'-'~100 \n\n\" I \n\n10-~0b.--'\"-U.~I~\"ol_:r'-'.w.m~I\"O~I _r-u.li;~lo.L_:r'-'-,-\"~uI\"O~I_ \u2022 .-'-'-l.l..lLU~lulo,,,\"_ \u2022 .l.....L..Lu.;I\"~ol-cr'-'-~1 00 \n\n(Average  Della-E  per  block) \n\n(Average  Della-E  per  block) \n\nFigure  3:  Saliency  Comparison:  (left),  saliency  as  in  eq.  (1);  (right),  the  absolute \ngradient was  used  instead. \n\nReferences \n\nA.  Conn,  N.  Gould,  A.  Sartanaer,  &  Ph . Toint.  (1993)  Global  Convergence  of a \nClass  of Trust  Region  Algorithms for  Optimization  Using  Inexact  Projections  on \nConvex Constraints.  SIAM J.  of Optimization,  3(1) :164-221. \n\nR.  Durbin  &  D.  Willshaw.  (1987)  An  Analogue  Approach  to  the  TSP  Problem \nUsing  an Elastic  Net  Method.  Nature , 326:689-691. \n\nJ. Hopfield & D. W. Tank.  (1985) Neural Computation of Decisions in Optimization \nProblems.  Bioi.  Cybernei.,  52:141-152. \n\nC.  Koch , J.  Marroquin  &  A. Yuille.  (1986)  Analog  'Neuronal'  Networks  in  Early \nVision.  Proc.  of the  National Academy  of Sciences  USA,  83:4263-4267. \n\nC .  Koch,  &  S.  Ullman.  (1985)  Shifts  in  Selective  Visual  Attention :  Towards  the \nUnderlying  Neural  Circuitry.  Human  Neurobiology , 4 :219-227 . \n\nE. Mjolsness, C. Garrett, & W. Miranker.  (1991) Multiscale Optimization in Neural \nNets.  IEEE  Trans.  on  Neural Networks , 2(2):263-274 . \n\nE.  Mjolsness  &  W.  Miranker. \n(1993)  Greedy  Lagrangians  for  Neural  Networks: \nThree  Levels  of  Optimization  in  Relaxation  Dynamics.  YALEU/DCS/TR-945. \n(URL  file:!!cs.ucsd.edu!pub!emj!papers!yale-TR-945.ps.Z) \n\nB.  Olshausen,  C.  Anderson,  &  D.  Van  Essen.  (1993)  A  Neurobiological  Model  of \nVisual  Attention and Invariant Pattern Recognition Based on  Dynamic Routing of \nInformation.  The  Journal  of Neuroscience , 13(11):4700-4719. \n\nD.  Terzopoulos.  (1986)  Regularization  of Inverse  Visual  Problems  Involving  Dis(cid:173)\ncontinuities.  IEEE  Trans.  PAMI, 8:419-429 . \n\nD.  I.  Tsioutsias  &  E.  Mjolsness.  (1995)  Global  Optimization in  Neural  Nets:  A \nNovel  Relaxation  Framework . To appear as  a  UCSD-CSE-TR,  Dec.  1995. \n\n\f", "award": [], "sourceid": 1022, "authors": [{"given_name": "Dimitris", "family_name": "Tsioutsias", "institution": null}, {"given_name": "Eric", "family_name": "Mjolsness", "institution": null}]}