{"title": "Rule Induction through Integrated Symbolic and Subsymbolic Processing", "book": "Advances in Neural Information Processing Systems", "page_first": 969, "page_last": 976, "abstract": null, "full_text": "Rule Induction through Integrated Symbolic and \n\nSubsymbolic Processing \n\nClayton McMillan, Michael C. Mozer, Paul Smolensky \n\nDepartment of Computer Science and \n\nInstitute of Cognitive Science \n\nUniversity of Colorado \n\nBoulder, CO  80309-0430 \n\nAbstract \n\nWe  describe a neural network, called RufeNet, that learns explicit, sym(cid:173)\nbolic  condition-action  rules  in  a  formal  string  manipulation  domain. \nRuleNet  discovers  functional  categories  over  elements  of the  domain, \nand,  at  various  points  during  learning,  extracts  rules  that  operate  on \nthese  categories.  The  rules  are  then  injected  back  into  RuleNet  and \ntraining continues,  in  a  process called iterative projection.  By  incorpo(cid:173)\nrating rules in this way,  RuleNet exhibits enhanced learning and gener(cid:173)\nalization  performance  over  alternative  neural  net  approaches.  By \nintegrating symbolic  rule  learning  and  subsymbolic  category  learning, \nRuleNet  has capabilities  that go beyond a  purely  symbolic system.  We \nshow  how  this  architecture  can be  applied to  the  problem  of case-role \nassignment  in  natural  language  processing,  yielding a  novel  rule-based \nsolution. \n\n1  INTRODUCTION \nWe  believe  that  neural  networks  are  capable  of more  than  pattern  recognition;  they  can \nalso  perform  higher  cognitive  tasks  which  are  fundamentally  rule-governed.  Further we \nbelieve that they can perform higher cognitive tasks better if they incorporate rules rather \nthan eliminate them. A  number of well known cognitive models, particularly of language, \nhave been criticized for going too far  in  eliminating rules in fundamentally rule-governed \ndomains. We argue that with a suitable choice of high-level, rule-governed task, represen(cid:173)\ntation, processing architecture, and  learning algorithm,  neural networks can represent and \nlearn rules  involving higher-level categories while  simultaneously learning those catego(cid:173)\nries.  The resulting networks can exhibit better learning and  task  performance than  neural \nnetworks  that do  not  incorporate  rules,  have  capabilities  that  go beyond  that  of a  purely \nsymbolic rule-learning algorithm. \n\n969 \n\n\f970 \n\nMcMillan,  Mozer, and Smolensky \n\nWe  describe  an  architecture,  called  RuleNet,  which  induces  symbolic  condition-action \nrules  in  a string  mapping domain.  In  the  following  sections we describe this domain,  the \ntask  and  network  architecture,  simulations  that  demonstrate  the  potential  for  this \napproach,  and  finally,  future  directions of the  research  leading toward  more  general  and \ncomplex domains. \n\n2  DOMAIN \nWe are interested in domains that map input strings to output strings. A string consists of n \nslots,  each containing a  symbol.  For example,  the  string abed contains  the symbol  e  in \nslot  3.  The  domains  we  have  studied  are  intrinsically  rule-based,  meaning that  the  map(cid:173)\nping  function  from  input  to  output  strings  can  be  completely  characterized  by  explicit, \nmutually exclusive condition-action rules.  These  rules are  of the general  form  \"if certain \nsymbols are present ill the input then perform a certain mapping from  the input slots to the \noutput slots.\" The  conditions do  not operate  directly  on  the  input  symbols, but rather on \ncategories defined  over the  input symbols.  Input symbols can  belong to  mUltiple  catego(cid:173)\nries.  For  example,  the  words  boy and  girl are  instances  of the  higher  level  category \nHUMAN.  We  denote  instances  with  lowercase  bold  font,  and  categories  with  uppercase \nbold  font.  It should  be  apparent  from  context  whether  a  letter  string  refers  to  a  single \ninstance, such as boy, or a string of instances, such as  abed. \n\nThree types of conditions are allowed:  1) a simple condition,  which states that an instance \nof some category must be present in a particular slot of the input string, 2) a conjunction of \ntwo simple conditions, and 3) a disjunction of two simple conditions. A  typical condition \nmight be that an instance of the category W must be present in  slot 1 of the input string and \nan  instance of category Y must be present in slot 3. \n\nThe action performed by a rule produces an output string in which the content of each slot \nis  either  a  fixed  symbol or a  function  of a  particular input  slot,  with  the  additional  con(cid:173)\nstraint that each input slot maps  to at most one output slot.  In  the  present work, this  func(cid:173)\ntion  of the  input  slots  is  the  identity  function.  A  typical  action  might  be  to  switch  the \nsymbols in  slots  1 and 2 of the input, replace slot 3 with the symbol a, and copy slot 4 of \nthe input to the  output string unchanged, e.g., abed - baad. \n\nWe  call rules of this general form  second-order categorical permutation (SCP)  rules. The \nnumber of rules grows exponentially with the length of the strings and the number of input \nsymbols. An example of an  SCP rule for strings of length four is: \n\nif (input1  is an  instance of  Wand input] is  an  instance of Y)  then \n(output1 = input2' oUtput2 = input1' output] = a, output4==input4) \n\nwhere  illputa  and  outputJl  denote  input  slot a  and  output slot  ~, respectively.  As a  short(cid:173)\nhand  for this rule, we write [A  W_Y_ - 21a4], where the square brackets indicate this is \na rule, the\" A\" denotes a conjunctive condition, and the \"_\" denotes a wildcard symbol. A \ndisjunction is denoted by \"v\". \n\nThis formal string manipulation task can be viewed as an abstraction of several interesting \ncognitive models in  the  connectionist  literature, including case-role assignment (McClel(cid:173)\nland &  Kawamoto,  1986), grapheme-phoneme mapping (Sejnowski &  Rosenberg,  1987), \nand mapping verb stems to the past tense (Rumelhart &  McClelland, 1986). \n\n\fRule  Induction  through  Integrated Symbolic and Subsymbolic Processing \n\n971 \n\nsingle unit \n\no \nc:::::I  layer of units \n. - complete connectivity \nI>-- gating connection \n\nm condition units \n\nn pools of v category units \n\nn pools of u hidden units \n\nFigure 1:  The RuleNet Architecture \n\ninput \n\n3  TASK \nRuleNet's  task  is  to  induce  a  compact  set  of rules  that  accurately  characterizes  a  set  of \ntraining examples. We  generate training examples using a predefined rule base. The rules \nare over strings of length four and alphabets which are subsets of {a,  b,  c,  d,  e,  f,  g, \nh,  i,  j, k,  I}.  For  example,  the  rule  [v  Y _VI _ - 4h21]  may  be  used  to  generate  the \nexemplars: \n\nhedk - kheh, cldk-khlc, gbdj - j  hbg, gdbk-khdg \n\nwhere category VI  consists of a,  b,  c,  d,  i, and category Y consists of f,  g,  h. Such \nexemplars form  the  corpus used to  train RuleNet.  Exemplars whose input strings meet the \nconditions  of several  rules  are  excluded.  RuleNet's  task  is  twofold:  It must  discover the \ncategories solely  based  upon  the  usage  of their  instances, and  it  must  induce  rules based \nupon those categories. \n\nThe rule  bases  used  to  generate  examples are  minimal in  the  sense  that  no smaller set of \nrules could  have  produced  the examples. Therefore,  in  our simulations the  target  number \nof rules to be induced is the same as the number used to generate the  training corpus. \n\nThere  are  several  traditional,  symbolic  systems,  e.g.,  COBWEB  (Fisher,  1987),  that \ninduce  rules  for  classifying  inputs  based  upon  training  examples.  It seems  likely  that, \ngiven the correct representation, a system such as COBWEB could learn rules that would \nclassify patterns in our domain.  However, it  is not clear whether such a system could also \nlearn  the action associated with each  class.  Classifier systems (Booker, et ai.,  1989) learn \nboth conditions and actions, but thcre is no obvious way  to map a symbol in  slot a  of the \ninput  to  slot  ~ of the  output.  We  have  also devised  a  greedy  combinatoric  algorithm  for \ninducing this type of rule, which has a number of shortcomings in comparison to RuleNet. \nSee McMillan (1992) for comparisons of RuleNet and alternative symbolic approaches. \n\n4  ARCHITECTURE \nRuleNet  can  implement  SCP  rules  of the  type  outlined  above.  As  shown  in  Figure  1, \nRuleNet has five  layers of units: an  input layer, an output layer, a layer of category units, a \nlayer  of condition  units,  and  a  layer  of hidden  units.  The  operation  of RuleNet  can  be \ndivided  into  three  functional  components:  categorization  is  performed  in  the  mapping \nfrom  the input layer to the category layer via the hidden units, the conditions are evaluated \nin the mapping from the category layer to the condition layer, and actions are performed in \n\n\f972 \n\nMcMillan.  Mozer. and Smolensky \n\nthe mapping from  the input layer to the output layer, gated by the condition units. \n\nThe  input layer is divided  into  II  pools of units,  one for each slot,  and  activates the  cate(cid:173)\ngory layer, which is also divided into 11  pools. Input pool a  maps to category pool a. Units \nin category pool a  represent possible categorizations of the symbol in  input slot a. One or \nmore  category  units will  respond  to  each  input symbol.  The activation of the  hidden  and \ncategory  units  is  computed  with  a  logistic  squashing function.  There  are  m  units  in  the \ncondition layer, one per rule. The activation of condition unit i, Pi'  is computed as follows: \n\nlogistic (11 e t;) \n\np.  -\nI  ~ logistic (Ilet) \n\nJ \n\nThe activation Pi represents the probability that rule i applies to the current input. The nor(cid:173)\nmalization  enforces  a  soft  winner-take-all  competition  among  condition  units.  To  the \ndegree that a condition unit wins, it enables a set of weights from the input layer to the out(cid:173)\nput layer.  These weights correspond to  the action for  a particular rule. There is one set of \nweights, A j ,  for each of the m rules. The activation of the output layer, y,  is calculated from \nthe  input layer, x, as follows: \n\nEssentially, the  transformation Ai for rule each rule i is applied to the input, and it contrib(cid:173)\nutes to  the output to the degree that condition i is  satisfied. Ideally, just one condition unit \nwill be fully  activated by a given input, and the rest will remain  inactive. \n\nThis architecture is based on  the local expert architecture of Jacobs, Jordan, Nowlan, and \nHinton  (1991),  but  is  independently  motivated  in  our  work  by  the  demands  of the  task \ndomain.  RuleNet  has  essentially  the  same  structure  as  the  Jacobs  network,  where  the \naction  substructure  of RuleNet  corresponds  to  their local experts and  the  condition  sub(cid:173)\nstructure corresponds to their gatillg lIetwork.  However, their goal-to minimize crosstalk \nbetween logically independent sub tasks-is quite different than ours. \n\n4.1  Weight Templates \nIn order to  interpret the weights in  RuleNet as symbolic SCP rules, it is necessary to estab(cid:173)\nlish a correspondence between regions of weight space and SCP rules. \n\nA  weight template is a  parameterized set  of constraints on some weights-a manifold  in \nweight space-that has a direct correspondence to an  SCP rule. The strategy behind itera(cid:173)\ntive  projection  is  twofold:  constrain  gradient  descent  so  that  weights  stay  close  to  tem(cid:173)\nplates  in  weight  space,  and  periodically  project  the  learned  weights  to  the  nearest \ntemplate, which can then  readily be interpreted as a set of SCP rules. \n\nFor SCP rules,  there are  three  types of weight templates: one dealing with categorization, \none with rule conditions, and one with rule actions. Each type of template is defined over a \nsubset  of  the  weights  in  RuleNet.  The  categorization  templates  are  defined  over  the \nweights from  input to category units, the condition templates are defined over the weights \nfrom  category  to  condition  units  for  each  rule  i,  ci '  and  the  action  templates  are  defined \nover the weights from  input to output units for each rule i, Ai' \n\n\fRule  Induction  through  Integrated Symbolic and Subsymbolic Processing \n\n973 \n\nCategory templates. The category templates specify that the mapping from each input slot \na  to  category pool a, for  1 s  a  S  II,  is  uniform. This imposes category  invariance across \nthe  input string. \n\nCondition  templates.  The weight vector ci ,  which  maps category activities  to  the  activity \nof condition  unit i,  has Vil  elements-v being the  number of category units per slot and 11 \nbeing the  number of slots. The fact  that the condition unit should respond  to  at  most one \ncategory  in  each  slot  implies  that  at  most  one  weight  in  each  v-element  subvector of cj \nshould be nonzero. For example, assuming there are three categories, N,  X,  and  Y,  the vec(cid:173)\ntor cj  that detects the simple condition  \"illput2 is an instance of X\"  is:  (000 OcpO  000 000), \nwhere  cp  is  an  arbitrary  parameter.  Additionally,  a  bias  is  required  to  ensure  that  the  net \ninput will be negative unless the condition is satisfied. Here, a bias value, b, of -O.5cp will \nsuffice. For disjunctive and conjunctive conditions, weights in two slots should be equal to \ncp,  the rest zero, and  the  appropriate bias is -.5cp or -1.5cp,  respectively. There is a weight \ntemplate for each condition  type and each combination of slots that takes part in  a condi(cid:173)\ntion.  We  generalize these  templates further in  a variety of ways.  For instance, in  the case \nwhere  each  input  symbol  falls  into  exactly one category,  if a  constant  Ea  is  added  to  all \nweights of Cj  corresponding to slot a  and Ea  is also subtracted from  b,  the net input to con(cid:173)\ndition unit i will be unaffected. Thus, the weight template must include the  {E a }. \nAction templates.  If we wish  the  actions carried out by  the  network to  correspond  to the \nstring manipulations  allowed by our rule  domain,  it  is  necessary  to  impose  some  restric(cid:173)\ntions on the values assigned to the action weights for rule i, A j \u2022  Ai has an 11  x Il block form, \nwhere II  is  the  length of input/output strings.  Each block  is a k  x k submatrix,  where k  is \nthe number of elements in the representation of each input symbol. The block at block-row \n~,  block-column  a  of Aj copies  illputa to  outputr. if  it  is  the  identity  matrix.  Thus,  the \nweight templates restrict each block to being either the identity matrix or the zero matrix. \nIf outputr. is  to  be  a fixed  symbol, then block-row ~ must be  all  zero except for the output \nbias weights in block-row ~. \nThe  weight  templates  are  defined  over a  submatrix  Ajr.'  the  set  of weights  mapping  the \ninput to  an output slot  ~. There are 11+1  templates, one for the  mapping of each  input slot \nto  the output, and one for the  writing of a fixed symbol to  the output. An additional con(cid:173)\nstraint  that  only one block  may  be  nonzero  in  block-column a  of Ai  ensures  that  inputa \nmaps to at most one output slot. \n\n4.2  Constraints on Weight Changes \nRecall that the strategy in  iterative projection is to constrain weights to be close to the tem(cid:173)\nplates described above, in order that they may be readily interpreted as symbolic rules. We \nuse a combination of hard and soft constraints, some of which we briefly describe here. \n\nTo ensure that during learning every block in Ai approaches the identity or zero matrix, we \nconstrain  the  off-diagonal  terms  to  be  zero  and  constrain  weights  along  the  diagonal  of \neach block  to  be  the  same,  thus  limiting the degrees of freedom  to one parameter within \neach  block.  All  weights  in  Cj  except  the  bias  are  constrained  to  positive  or zero  values. \nTwo  soft  constraints are  imposed  upon  the  network to encourage all-or-none  categoriza(cid:173)\ntion  of input  instances: A  decay term  is  used on all  weights in  cj  except the  maximum  in \neach slot, and a second cost term encourages binary activation of the category units. \n\n\f974 \n\nMcMillan,  Mozer, and Smolensky \n\n4.3  Projection \nThe  constraints described above do  not guarantee  that learning will produce weights that \ncorrespond exactly to SCP rules. However, using projection, it is possible to transform the \ncondition  and  action  weights such  that  the  resulting  network can  be interpreted  as  rules. \nThe essential idea of projection is to take a set of learned weights, such as CI ,  and compute \nvalues  for  the  parameters  in  each  of the  corresponding  weight  templates  such  that  the \nresulting  weights  match  the  learned  weights.  The  weight  template  parameters  are  esti(cid:173)\nmated using a  least squares procedure,  and  the  closest template,  based upon  a  Euclidean \ndistance metric, is taken to be the projected weights. \n\n5  SIMULATIONS \nWe ran sim ulations on 14 different training sets, averaging the performance of the network \nover  at  least  five  runs  with  different  initial  weights  for  each  set.  The  training  data  were \ngenerated  from  SCP  rule  bases  containing  2-8 rules  and  strings  of length  four.  Between \nfour and eight categories were used. Alphabets ranged from  eight to 12 symbols. Symbols \nwere  represented  by  either local or distributed activity  vectors.  Training set sizes ranged \nfrom 3-15% of possible examples. \nIterative  projection  involved the following  steps:  (1) start with one  rule  (one  set of c;-AI \nweights), (2) perform gradient descent for 500-5,000 epochs, (3) project to the  nearest set \nof SCP rules and  add a new rule. Steps (2) and (3) were repeated until the training set was \nfully covered. \n\nIn virtually  every run  on  each data  set in  which  RuleNet converged  to  a  set of rules that \ncompletely covered the training set, the rules extracted were exactly the original rules used \nto generate the  training set.  In  the few  remaining runs,  RuleNet discovered an  equivalent \nset of rules. \nIt  is  instructive  to  examine  the  evolution of a rule set.  The  rightmost column of Figure 2 \nshows a set of five  rules over four categories, used to generate 200 exemplars, and the  left \nportion of the Figure shows the evolution of the hypothesis set of rules learned by RuleNet \nover 20,000 training epochs,  projecting every 4000 epochs. At epoch 8000,  RuleNet  has \ndiscovered  two  rules  over two  categories,  covering  24.5%  of the  training  set.  At  epoch \n12,000,  RuleNet  has  discovered  three  rules  over  three  categories,  covering 52%  of the \ntraining  set.  At  epoch  20,000,  RuleNet  has  induced  five  rules  over  four  categories  that \n\nepoch 8000 \n\nepoch 12,000 \n\nepoch 20,000 \n\noriginal rules/categ. \n[v  B_C_ - 4h21] [v  B_C_ - 4h21]  [v  B_C_ - 4h21]  [v  Y_W_  - 4h21] \n[1\\  _B_C  - 341\u00a3] [1\\  _EC - 2413]  [  _B_ - 4213]  [  _Y_ - 4213] \n[1\\  _B_B  - 321\u00a3]  [v _E_D  - 342\u00a3]  [v _Z_X  - 342\u00a3] \n[1\\  _D_B  - 3214]  [1\\  _X_Y  - 3214] \n[v  _EC - 2413]  [v  _ZW - 2413] \nCateg.  Instance \n\nInstance \n\nCateg. \n\nCateg. \n\nCateg. \n\nInstance \nf  9  h \n\nB \nC  abc i \n\nInstance \nf  9  h \n\nB \nC  abc d  i \nE \n\nj  k \n\na  i \n\nC  abc d  i \nD \nB \nE \n\ne  9  1 \nf  9  h \na  c  i \n\nj  k \n\nw  abc d  i \ne  9  1 \nX \ny \nf  9  h \nz  a  c  i \n\nj  k \n\nFigure 2:  Evolution of a Rule Set \n\n\fRule  Induction  through Integrated Symbolic  and Subsymbolic  Processing \n\n975 \n\nTable  1:  Generalization performance of RuleNet (average of five  runs) \n\nArchitecture \n\nRuleNet \nJacobs architecture \n3-layer backprop \n# of patterns in set \n\nData Set 1 \n(8 Rules) \ntest \ntram \n100 \n100 \n22 \n100 \n27 \n100 \n120  1635 \n\n% of patterns correctly mapped \nData Set 2 \n(3 Rules) \ntram \ntest \n100 \n100 \n100 \n7 \n100 \n7 \n45  1380 \n\nData Set 3 \n(3 Rules) \ntram \ntest \n100 \n100 \n14 \n100 \n14 \n100 \n45  1380 \n\nData Set 4 \n(5  Rules) \ntram \ntest \n100 \n100 \n27 \n100 \n100 \n35 \n75  1995 \n\ncover 100% of the  training examples. A close comparison of these rules with the original \nrules shows that they only differ in the arbitrary labels RuleNet has attached to  the catego(cid:173)\nries. \n\nLearning rules can greatly enhance generalization. In cases where RuleNet learns the orig(cid:173)\ninal rules, it can be expected to generalize perfectly to  any pattern created by those rules. \nWe  compared the  performance of RuleNet to  that of a standard three-layer backprop net(cid:173)\nwork (with  15  hidden  units  per  rule)  and  a  version  of the  Jacobs  architecture,  which  in \nprinciple has the capacity to perform the task. Four rule bases were tested, and roughly 5% \nof the  possible examples were used for training and the remainder were used for generali(cid:173)\nzation testing.  Outputs were thresholded  to 0 or 1. The cleaned up outputs were compared \nto  the  targets  to  determine which were  mapped  correctly.  All  three  learn  the  training set \nperfectly. However, on  the  test set,  RuleNet's ability  to  generalize is 300% to 2000% bet(cid:173)\nter than the other systems (Table1). \n\nFinally,  we  applied  RuleNet  to  case-role  assignment,  as  considered  by  McClelland  and \nKawamoto (1986). Case-role assignment is the  problem of mapping syntactic constituents \nof a  sentence  to  underlying semantic, or thematic,  roles.  For example,  in  the  sentence, \n\"The boy broke the window\", boy is the subject at the syntactic level and the agent, or act(cid:173)\ning entity, at the semantic level. Window is the object at the syntactic level and the patient, \nor entity  being acted  upon,  at  the  semantic level.  The words of a sentence can be  repre(cid:173)\nsented as a string of Il slots, where each slot is  labeled with a constituent, such as subject, \nand that slot is filled with the corresponding word, such as boy. The output is handled anal(cid:173)\nogously. We  used McClelland and Kawamoto's 152 sentences over 34 nouns and verbs as \nRuleNet's training set. The five  categories and six rules induced by RuleNet are shown in \nTable 2, where S = subject, 0  = object, and wNP = noun in the with noun-phrase. We con(cid:173)\njecture  that  RuleNet  has  induced  such  a  small  set  of rules  in  part  because  it  employs \n\nTable 2:  SCP Rules Induced by RuleNet in  Case-Role Assignment \n\nRule \n\nif 0  = VICTIM then wNP-modifier \nif 0  = THING  1\\  wNP = UTENSIL \n\nthen wNP-instrument \n\nif S  = BREAKER then S-instrument \nif S = THING then S-patient \nif V = moved then self-patient \nif S  = ANIMATE then food-patient \n\nSample of Sentences Handled Correctly \nThe boy ate the pasta with cheese. \nThe boy ate the pasta with the fork. \n\nThe rock  broke the window. \nThe window broke.  The fork  moved. \nThe man moved. \nThe lion ate. \n\n\f976 \n\nMcMillan,  Mozer, and Smolensky \n\nimplicit conflict resolution, automatically assigning strengths to categories and conditions. \nThese rules cover 97% of the training set and perform the correct case-role assignments on \n84% of the 1307 sentences in the test set. \n\n6  DISCUSSION \nRuleNet  is but  one example  of a  general  methodology  for  rule  induction  in  neural  net(cid:173)\nworks.  This methodology  involves five  steps:  1)  identify  a fundamentally  rule-governed \ndomain,  2)  identify  a  class  of rules  that  characterizes  that  domain,  3)  design  a  general \narchitecture,  4)  establish  a  correspondence  between  components  of symbolic  rules  and \nmanifolds  of  weight  space-weight  templates,  and  5)  devise  a  weight-template-based \nlearning procedure. \nUsing this methodology, we have shown that RuleNet is able to perform both category and \nrule learning.  Category learning strikes us  as  an  intrinsically subsymbolic process.  Func(cid:173)\ntional categories are often fairly arbitrary (consider the classification of words as nouns or \nverbs) or have complex statistical structure (consider the classes \"liberals\" and \"conserva(cid:173)\ntives\"). Consequently, real-world categories can seldom be described in  terms of boolean \n(symbolic) expressions; subsymbolic representations are more appropriate. \n\nWhile category  learning is  intrinsically subsymbolic, rule  learning is  intrinsically a sym(cid:173)\nbolic process.  The integration  of the  two  is  what  makes  RuleNet  a  unique  and  powerful \nsystem.  Traditional  symbolic  machine  learning  approaches  aren't  well  equipped  to  deal \nwith  subsymbolic  learning,  and  connectionist  approaches  aren't  well  equipped  to  deal \nwith the symbolic. RuleNct combines the strengths of each approach. \n\nAcknowledgments \nThis research was supported by  NSF Presidential Young Investigator award IRI-9058450, grant 90-\n21  from  the James S.  McDonnell  Foundation,  and  DEC external  research grant  1250 to  MM;  NSF \ngrants IRI-8609599 and ECE-8617947 to PS;  by a grant from the Sloan Foundation's computational \nneuroscience  program to  PS;  and by the  Optical Connectionist Machine Program of the NSF Engi(cid:173)\nneering  Research  Center  for  Optoelectronic  Computing  Systems  at  the  University  of Colorado at \nBoulder. \nReferences \nBooker,  L.B.,  Goldberg,  D.E., and  Holland, J.H. (1989). Classifier systems and genetic algorithms, \nArtificiallntelligellce 40:235-282. \nFisher, D.H. (1987). Knowledge acquisition via incremental concept clustering. Machine Learning \n2:139-172. \nJacobs,  R.,  Jordan,  M.,  Nowlan,  S., Hinton, G.  (1991). Adaptive  mixtures of local  experts. Neural \nComputation, 3:79-87. \nMcClelland, J. &  Kawamoto, A. (1986). Mechanisms of sentence processing: assigning roles to con(cid:173)\nstituents. In J.L. McClelland, D.E. Rumelhart, &  the PDP Research Group, Parallel Distributed Pro(cid:173)\ncessing: Explorations in  tire microstructure of cognition,  Vol.  2.  Cambridge,  MA:  MIT PresslBrad(cid:173)\nford Books. \nMcMillan, C.  (1992).  Rule  induction in  a neural  network  through  integrated symbolic and subsym(cid:173)\nbolic processing. Unpublished Ph.D. Thesis. Boulder, CO: Department of Computer Science, Univer(cid:173)\nsity of Colorado. \nRumelhart, D., & McClelland, 1. (1986). On learning the past tense of English verbs. In 1.L. McClel(cid:173)\nland, D.E. Rumelhart, &  the PDP Research Group, Parallel Distributed Processing: Explorations in \nthe microstructure of cognition. Vol.  2.  Cambridge, MA:  MIT PresslBradford Books. \nSejnowski, T.  1.  &  Rosenberg, C.  R.  (1987).  Parallel networks that learn to  pronounce English  text, \nComplex Systems, 1:  145-168. \n\n\f", "award": [], "sourceid": 520, "authors": [{"given_name": "Clayton", "family_name": "McMillan", "institution": null}, {"given_name": "Michael", "family_name": "Mozer", "institution": null}, {"given_name": "Paul", "family_name": "Smolensky", "institution": null}]}