{"title": "Information Factorization in Connectionist Models of Perception", "book": "Advances in Neural Information Processing Systems", "page_first": 45, "page_last": 51, "abstract": null, "full_text": "Information Factorization in \n\nConnectionist Models of Perception \n\nJavier R.  Movellan \n\nDepartment of Cognitive Science \nInstitute for  Neural  Computation \nUniversity of California San Diego \n\nJames L.  McClelland \n\nCenter for  the Neural  Bases of Cognition \n\nDepartment of Psychology \nCarnegie Mellon University \n\nAbstract \n\nWe  examine  a  psychophysical  law  that  describes  the  influence  of \nstimulus and context  on  perception.  According to this law  choice \nprobability  ratios  factorize  into  components  independently  con(cid:173)\ntrolled by stimulus and context.  It has been argued that this  pat(cid:173)\ntern of results is  incompatible with feedback models of perception. \nIn this  paper we  examine this  claim  using neural  network models \ndefined via stochastic differential equations.  We show that the law \nis  related to a  condition  named channel separability and has little \nto do with the existence of feedback connections.  In essence, chan(cid:173)\nnels are separable if they converge into the response units without \ndirect lateral connections to other channels and if their sensors are \nnot  directly  contaminated  by  external  inputs  to  the  other  chan(cid:173)\nnels.  Implications of the analysis for  cognitive and computational \nneurosicence are discussed. \n\n1 \n\nIntroduction \n\nWe examine a psychophysical law,  named the Morton-Massaro law,  and its implica(cid:173)\ntions to connectionist models of perception and neural information processing.  For \nan example of the type of experiments covered by the Morton-Massaro law consider \nan experiment by Massaro and Cohen (1983)  in which subjects had to identify syn(cid:173)\nthetic  consonant  sounds  presented in  the  context  of other phonemes.  There were \ntwo response  alternatives,  seven  stimulus  conditions,  and four  context  conditions. \nThe response alternatives were /1/  and /r/, the stimuli were synthetic sounds gen(cid:173)\nerated by  varying the onset frequency  of the third formant,  followed  by  the vowel \n/i/.  Each of the 7 stimuli was placed after each offour different context consonants, \n/v/, /s/, /p/, and /t/.  Morton (1969) and Massaro independently showed that in a \nremarkable range of experiments of this type, the influence of stimulus and context \non  response probabilities  can be accounted for  with  a  factorized  version  of Luce's \nstrength model  (Luce,  1959) \nP(R = k I S = i, C = j) \n\nTls(i, k) Tlc(j, k) \n2:  C l)  C l)'  for  (l,,),k) E S  x  ex 'R.  (1) \nI TIs  1\" \nTIc), \nHere S,  C  and R  are random variables representing the stimulus,  context  and the \nsubject's  response,  S,  C and  'R  are  the set  of stimulus,  context  and  response  al-\n\n.\n\n. \n\n\f46 \n\nJ.  R.  Movellan and J.  L.  McClelland \n\nternatives,  l1s(i, k)  >  0  represents  the  support  of stimulus  i  for  response  k,  and \nl1c(j, k)  > 0 the support of context j  for  response k.  Assuming no strength param(cid:173)\neter is  exactly zero,  (1)  is  equivalent to \nP(R = k I S = i,e = j) = '(l1S(i,k))  (l1c(j,k)) , for all (i,j,k) E  S  x  ex R. \nP(R = II S = i, e  = j) \n(2) \n\nl1c(j, l) \n\nl1s(i, l) \n\nThis says that response probability ratios factorize into two components, one which \nis  affected  by  the stimulus  but unaffected  by  the context and one affected  by  the \ncontext but unaffected by the stimulus. \n\n2  Diffusion Models of Perception \n\nMassaro  (1989)  conjectured  that  the  Morton-Massaro  law  may  be  incompatible \nwith feedback models of perception.  This conjecture was based on the idea that in \nnetworks with feedback connections the stimulus can have an effect  on the context \nunits and the context can have an effect on the stimulus units making it impossible \nto  factorize  the  influence  of  information  sources.  In  this  paper  we  analyze  such \na  conjecture and show that,  surprisiQ.gly,  the Morton-Massaro law  has  little to  do \nwith  the  existence  of feedback  and  lateral  connections.  We  ground  our  analysis \non  continuous  stochastic  versions  of  recurrent  neural  networks  1.  We  call  these \nmodels diffusion (neural) networks for they are stochastic diffusion processes defined \nby  adding  Brownian  motion  to  the  standard  recurrent  neural  network  dynamics. \nDiffusion networks are defined  by the following  stochastic differential equation \n\ndYi(t)  =  JLi(Y(t), X) dt + (J  dBi(t)  for  i  E {I, ... , n}, \n\n(3) \nwhere Yi(t)  is  a  random variable representing the internal potential at time t  of the \nith  unit, Y(t) =  (Yl(t),\u00b7\u00b7\u00b7  ,Yn(t))', X  represents the external input, which consists \nof stimulus  and  context,  and  Bi  is  Brownian  motion,  which  acts  as  a  stochastic \ndriving  term.  The constant  (J  > 0,  known  as  the  dispersion,  controls  the amount \nof noise  injected  onto each  unit.  The function  JLi,  known  as  the  drift,  determines \nthe average instantaneous change of activation and is  borrowed from  the standard \nrecurrent  neural  network  literature:  this  change  is  modulated  by  a  matrix  w  of \nconnections between units, and a matrix v that controls the influence of the external \ninputs onto each unit. \n\nJLi(Yi(t), X) = ~i(Yi(t)) (Yi(t)  - Yi(t)), \n\n1 \n\n-\n\nfor  all i  E {I,\u00b7\u00b7\u00b7 , n}, \n\n(4) \n\nwhere  1/ ~i is  a  positive function,  named the capacitance, controlling the speed  of \nprocessing and \n\nYi(t)  =  L  Wi,j  Zj(t) + LVi,kXk, \n\nfor  alli E {I, .. \u00b7 ,n}, \n\nj \n\nk \n\nZj(t) = CPi(}j(t))  =  CP(O!i  }j(t)) =  1/(1 + e- a \u2022 Y;(t)). \n\n(5) \n\n(6) \n\nHere Wi ,j, an element of the connection matrix w, is the weight from unit j  to unit i, \nVi,k  is an element of the matrix v,  cP  is the logistic activation function and the O!i  > 0 \nterms  are  gain  parameters,  that control the sharpness  of the activation functions. \nFor large values of O!i  the activation function of unit i  converges to a step function. \nThe variable  Zj(t) represents a  short-time mean firing  rate (the activation)  of unit \n\nlFor an analysis grounded on discrete time networks with binary states see McClelland \n\n(1991). \n\n\fInformation Factorization \n\n47 \n\nj  scaled in the (0,1)  range.  Intuition for  equation  (4)  can be achieved  by thinking \nof it as a  the limit of a  discrete time difference equation,  in such  case \n\nY(t + ~t) =  Yi(t) + J.'i (Yi (t), X)~t + (rli5:tNi (t), \n\n(7) \n\nwhere the Ni(t)  are independent standard Gaussian random variables.  For a  fixed \nstate at time t  there are two  forces  controlling the change in  activation:  the drift, \nwhich  is  deterministic,  and  the  dispersion  which  is  stochastic.  This  results  in  a \ndistribution  of states  at  time  t  + ~t.  As  ~t goes  to  zero,  the  solution  to  the \ndifference  equation  (7)  converges  to  the  diffusion  process  defined  in  (4).  In  this \npaper we focus  on the behavior of diffusion networks at stochastic equilibrium, i.e., \nwe  assume the network is  given enough time to approximate stochastic equilibrium \nbefore its response is  sampled. \n\n3  Channel Separability \n\nIn this section we  show that the Morton-Massaro is related to an architectural con(cid:173)\nstraint named  channel  separability,  which  has  nothing to do  with the existence of \nfeedback  connections.  In order to  define  channel  separability it  is  useful  to  char(cid:173)\nacterize the function  of different  units  using  the following  categories:  1)  Response \nspecification  units:  A unit is  a  response specification unit,  if,  when  the state of all \nthe  other units  in  the  network is  fixed,  changing the state of this  unit  affects  the \nprobability  distribution  of overt  responses.  2)  Stimulus  units:  A  unit  belongs  to \nthe stimulus channel if:  a)  it is  not a  response unit,  and b)  when  the state of the \nresponse units is fixed,  the probability distribution of the activations of this unit is \naffected by the stimulus.  3)  Context units:  A unit belongs to the context channel if: \na)  it is  not a  response unit,  and b)  when the states of the response  units are fixed, \nthe  probability  distribution  of the  activations  of this  unit  can  be affected  by  the \ncontext.  Given the above definitions,  we  say that a  network has separable  stimulus \nand  context  channels if the stimulus and context units are disjoint:  no  unit simul(cid:173)\ntaneously  belongs  to  the  stimulus  and  context  channels.  In  essence,  channels  are \nstructurally separable if they converge into the response units without direct lateral \nconnections to other channels and if their sensors are not directly contaminated by \nexternal inputs to the other channels  (see Figure 1). \n\nIn the rest of the paper we show that if a diffusion network is structurally separable \nthe Morton-Massaro law can be approximated with arbitrary precision regardless of \nthe existence of feedback connections.  For simplicity we examine the case in which \nthe weight matrix is symmetric.  In such case, each state has an associated goodness \nfunction  that greatly simplifies  the analysis.  In a  later section we  discuss  how  the \nresults generalize to the non-symmetric case. \nLet y  E IRn  represent the internal potential of a  diffusion network.  Let Zi  =  cp(aiYi) \nfor  i  =  1,\u00b7\u00b7\u00b7 , n  represent  the  firing  rates  corresponding  to  y.  Let  zS,  ZC  and \nzr  represent  the  components  of  z  for  the  units  in  the  stimulus  channel,  context \nchannel and response specification module.  Let x  be a  vector representing an input \nand let  x S ,  XC  be the components  of x  for  the external stimulus  and  context.  Let \na  =  (a1,\u00b7\u00b7\u00b7  , an)  be  a  fixed  gain  vector  and  ZO/(t)  a  random  vector  representing \nthe firing  rates at time t  of a  network with gain vector a.  Let za  =  limt-+oo za (t), \nrepresent the firing  rates at stochastic equilibrium.  In Movellan  (1998)  it is  shown \nthat  if the weights  are symmetric i.e.,  W = w'  and  l/Ki(x)  = dcpi(X)/dx  then  the \nequilibrium probability density of za  is as follows \n\nPZQlx(zs,zc,zr I XS,X C )  =  K  ( 1 \n\n) exp((2/0'2) Ga(zs,zr I XS,X C ))  , \n\na  Xs,Xc \n\n(8) \n\n\f48 \n\nJ.  R.  Movellan and J.  L.  McClelland \n\nSdmU~  /CoDtut \n\nInput \n\nFigure 1:  A network with separable context and stimulus processing channels.  The \nstimulus  sensor and stimulus  relay  units  make up the stimulus  channel  units,  and \nthe context  sensor  and  context  channel  units  make up  the  context  channel  units. \nNote that any of the modules can be empty except the response module. \n\nwhere \n\nn \n\nKa(xs, xc)  =  /  exp((2/(72)  Ga(z I Xs, xc)) dz, \nGa(z I x)  =  H(z I x) - L Sa; (Zi), \nH(z I x) = z' w z/2 + z' V x, \nSa; (Zi)  = ai (IOg(Zi) + log(1  - Zi)) + ~i (Zi log(zi) + (1  - Zi) log(1  - Zi)) . \n\ni=l \n\n(9) \n\n(10) \n\n(11) \n\n(12) \n\nWithout loss  of generality hereafter we  set  (72  =  2.  When there are no direct  con(cid:173)\nnections between the stimulus and context units there are no terms in the goodness \nfunction in which XS or ZS  occur jointly with XC  or ZC.  Because of this, the goodness \ncan be separated into three additive terms, that depend on x S ,  XC  and a  third term \nwhich  depends on the response units: \n\nGa(z\\zc,zr I XS,XC) =  G~(zs,zr I XS )  + G~(zr,zc I XC)  + G~(zr) , \n\n(13) \n\nwhere \n\nG~(ZS, zr I XS) = (zs),ws,szs/2 + (zS)'ws,rzr + (ZS)'vs,sx s + (zr),vr,sxs - L S(zt) , \n\nG~(ZC, zr I XS )  =  (ZC)'wc,czc /2 + (zc),wc,rzr + (zc),vc,cxc + (zr),vr,cxc - L S(zf) , \n\ni \n\n(14) \n\ni \n\n(15) \n\n(16) \n\n\fInformation Factorization \n\n49 \n\nwhere ws,r  is  a submatrix of w connecting the stimulus and response units.  Similar \nnotation is used for  the other submatrices of wand v.  It follows  that we  can write \nthe ratio of the jOint  probability density' of two states z  and z as follows: \nPZ .. lx(zs,zc,zr I XS,X C ) \nexp(G~(zS,zr I xs) + G~(zc,zr I XC)  + G~(zr\u00bb \npZ .. lx(zS,zC,zrlxs,xc )  - exp(G~(zS,zrlxs)+G~(zC,zrlxC)+G~(zr\u00bb' \n\n(17) \n\nwhich factorizes  as  desired.  To  get  probability densities for  the response units,  we \nintegrate over the states of all the other units \n\nPZ;;IX(zr I XS,XC) =  / \n\n/  PZ .. lx(zs,zc,zr I XS,XC) dz s dz c , \n\n(18) \n\nand after rearranging terms \n\npZ;;IX(zr I XS,XC) =  Kcr(:s,xC) (/ exp( Gz(zs,zr I XS) +  Gr(zr\u00bb  dZS) \n\n( /  exp( G c(ZC, zr I xc\u00bb  dZC)  , \n\n(19) \n\nwhich also factorizes.  All  is  left is  mapping continuous states of the response units \nto  discrete  external  responses.  To  do  so  we  partition  the  space  of the  response \nspecification units into discrete regions.  The probability of a  response becomes the \nintegral of the probability density over the region corresponding to that response. \nThe  problem is  that the integral of probability  densities  does  not  necessarily  fac(cid:173)\ntorize even  though the densities factorize  at every point. \n\nFortunately  there  are  two  important  cases  for  which  the law  holds,  at  least  as  a \ngood approximation.  The first  case is when the response regions are small and thus \nwe can approximate the integral over that region by the density at a point times the \nvolume of the region.  In such a  case the ratio of the integrals can be approximated \nby the ratio of the probability densities of those individual states.  The second case \napplies  to  models,  like  McClelland  and  Rumelhart's  (1981)  interactive  activation \nmodel,  in which  each  response  is  associated  with  a  distinct  response  unit.  These \nmodels  typically  have negative connections  amongst  the response units  so  that  at \nequilibrium  one  unit  tends  to be  active  while  the others  are inactive.  In  such  a \ncase a common response policy picks the response corresponding to the active unit. \nWe  now  show  that such  a  policy  can approximate the  Morton-Massaro law  to an \narbitrary level of precision as the gain parameter of the response units is increased. \nLet  z  represent  the  joint  state  of a  network  and  let  the  first  r  components  of  z \nbe the states of the response specification units.  Let  z(1)  =  (1,0,0, ... ,0)',  Z(2)  = \n(0,1,0,\u00b7\u00b7\u00b7 ,0)'  be  two  r-dimensional  vectors  representing  states  of  the  response \nspecification units.  For i  E {1,2} and ~ E (0,1)  let \n\nz~) =  (1  - Z(i\u00bb~ + (z(i\u00bb(l - ~), \nR~) = {x E IRr \n\n:  Xj  E ((1- ~)Z~i), ~ + (1  - ~)Z~i\u00bb, for j  = 1,\u00b7\u00b7\u00b7 , r}. \n\n(21) \nThe  sets  R~) and  R~) are  regions  of the  [O,l]r  space  mapping  into  two  distinct \nexternal responses.  We  now  investigate the convergence of the probability ratio of \nthese  two  responses  as  we  let  ~ 4  0,  i.e.,  as  the  response  regions  collapse  into \ncorners of [0, l]r. \n\n(20) \n\ncr \n\nA \n\np(zr E R(2) I X  =  x) \nlim \nA-+O P(Z~ E R~) I X  =  x) \n~rpZ;;IX(z~) I x) \n. \nhm \nA-+O  ~rpZ;;IX(zA  I x) \n\n(1) \n\n= lim  t . \"  \n\nJR(2) PzrlX(u I x)du \nA-+O  JR~) PZ;;lx(u I x)du \n\n= \n\n.  J J eG~(z~),z\u00b7,ze I z}dz 8  dzc \nA-+O J J eG .. (zt.  ,Z',ze I z)dzs dzc \n\n(1) \n\n=  hm \n\n\u2022 \n\n(22) \n\n(23) \n\n\f50 \n\nJ.  R.  Movellan and J.  L.  McClelland \n\nTable 1:  Predictions by the Morton-Massaro law  (left side)  versus diffusion network \n(square brackets)  for  subject  7 of Massaro and Cohen  (1983)  Experiment  2.  Each \nprediction of the diffusion network is  based on  100 random samples. \n\nContext \n\nStimulus \n\nV \n\n0 \n1 \n2 \n3 \n4 \n5 \n6 \n\n0.0017 \n0.0126 \n0.1105 \n0.5463 \n0.9827 \n0.9999 \n0.9999 \n\n0.01 \n0.00 \n0.19 \n0.54 \n1.00 \n1.00 \n1.00 \n\nS \n\n0.0000 \n0.0000 \n0.0008 \n0.0079 \n0.2756 \n0.9924 \n0.9924 \n\n0.00 \n0.00 \n0.00 \n0.00 \n0.30 \n0.99 \n1.00 \n\nP \n\nT \n\n0.0152 \n0.1008 \n0.5208 \n0.9133 \n0.9980 \n0.9999 \n0.9999 \n\n0.03 \n0.10 \n0.45 \n0.91 \n1.00 \n1.00 \n1.00 \n\n0.9000 \n0.9849 \n0.9984 \n0.9998 \n0.9999 \n1.0000 \n1.0000 \n\n0.91 \n0.97 \n1.00 \n1.00 \n1.00 \n1.00 \n1.00 \n\nNow  note that \n\nGo(z~), Z8, ZC  I x)  =  H(z~), Z8, ZC  I x)  - L So; (Z~!i) - L So; (zt) - L SOj (zj), \n\nr \n\ni=1 \n\ni \n\nj \n\n(24) \n\nand since E~=l So; (Z~!i) =  E;=1 So; (Z~!i)' it follows  that \n\n.  P(Z~ E R~) I X  =  x)  _  J J eH(z~),z\u00b7,ze I x)-E;Sa;(zi)-E j Saj(zj)dz 8  dz c \nhm \n~-+o P(Z~ E R~ I X  =  x)  J J eH(za..  ,Z',ze I x)-E; Sai(zt)-E j Saj(Zj)dz 8  dzc \n\n(1) \n\n\u2022 \n\n-\n\n(1) \n\n(25) \nIt is  easy  to  show  that  this  ratio  factorizes.  Moreover,  for  all  .6.  >  0  if  we  let \n0:1  = ... = O:r  = 0:,  where 0: > 0 then \n\nlim  P(Z~ E  [.6.,1  - .6.t) = 0, \n\n0-+00 \n\n(26) \n\nsince as the gain of the response units increases So;  decreases very fast at the corners \nof (0, 1 y.  Thus  as  0:  -4  00  the random  variable  Z~ converges in  distribution  to a \ndiscrete random variable with mass at the corner of the [0, It hypercube and with \nfactorized probability ratios as expressed on (25).  Since the indexing ofthe response \nunits is arbitrary the argument applies to all the responses. \n\no \n\n4  Discussion \n\nOur analysis  establishes that in diffusion  networks the Morton-Massaro law  is  not \nincompatible  with the presence of feedback  and  lateral  connections.  Surprisingly, \neven though in diffusion  networks with feedback  connections stimulus and context \nunits  are interdependent,  it is  still  possible  to  factorize  the  effect  of stimulus  and \ncontext on response probabilities. \n\nThe analysis  shows  that  the Morton-Massaro can  be  arbitrarily  approximated  as \nthe sharpness  of the  response  units  is  increased.  In  practice  we  have  found  very \ngood  approximations with relatively  small  values  of the sharpness  parameter  (see \nTable 1 for  an example).  The analysis assumed that the weights  were symmetric. \nMathematical  analysis  of the general  case with  non-symmetric weights is  difficult. \n\n\fInformation Factorization \n\n51 \n\nHowever useful approximations exist  (Movellan &  McClelland,  1995)  showing that \nif the noise parameter (7  is relatively small or if the activation function  c.p  is  approx(cid:173)\nimately  linear,  symmetric  weights  are not  needed  to exhibit  the  Morton-Massaro \nlaw. \nThe analysis presented here has potential applications to investigate models of per(cid:173)\nception  and  the functional  architecture of the brain.  For  example  the interactive \nactivation model of word perception has a separable architecture and thus, diffusion \nversions  of it  adhere to  the  Morton  Massaro  law.  The analysis  also points  to  po(cid:173)\ntential applications in computational neuroscience. It would be of interest to study \nwhether the Morton-Massaro holds  at the level  of neural  responses.  For example, \nwe  may  excite  a  neuron  with two  different  sources of information  and  observe its \nshort term average response to combination of stimuli.  If the observed distribution \nof responses  exhibits  the  Morton-Massaro law,  this  would  be  consistent  with  the \nexistence  of separable channels  converging  into that  neuron.  Otherwise,  it  would \nindicate  that  the  channels  from  the  two  input  areas  to  the  response  may  not  be \nstructurally separable. \n\nReferences \n\nLuce,  R.  D.  (1959).  Individual  choice  behavior.  New  York:  Wiley. \nMassaro,  D.  W.  (1989).  Testing between the TRACE Model  and the fuzzy  logical \n\nmodel of speech perception.  Cognitive  Psychology,  21,  398-42l. \n\nMassaro, D.  W.  (1998).  Perceiving  Talking  Faces.  Cambridge, Massachusetts:  MIT \n\nPress. \n\nMassaro, D.  W.  &  Cohen,  M.  M.  (1983a).  Phonological constraints in speech per(cid:173)\n\nception.  Perception  and Psychophysics,  94,  338-348. \n\nMcClelland, J. L.  (1991).  Stochastic interactive activation and the effect of context \n\non perception.  Cognitive  Psychology,  29,  1-44. \n\nMorton, J.  (1969). The interaction of information in word recognition.  Psychological \n\nReview,  76,  165-178. \n\nMovellan,  J.  R.  (1998).  A  Learning Theorem for  Networks  at Detailed  Stochastic \n\nEquilibrium.  Neural  Computation,  10(5),  1157-1178. \n\nMovellan, J. R.  & McClelland, J. L.  (1995) . Stochastic interactive processing, chan(cid:173)\nnel  separability  and  optimal  perceptual  inference:  an  examination  of  Mor(cid:173)\nton's law.  Technical Report PDP.CNS.95A, Available at http://cnbc.cmu.edu, \nCarnegie Mellon  University. \n\n\f", "award": [], "sourceid": 1678, "authors": [{"given_name": "Javier", "family_name": "Movellan", "institution": null}, {"given_name": "James", "family_name": "McClelland", "institution": null}]}