{"title": "An Information Theoretic Approach to Rule-Based Connectionist Expert Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 256, "page_last": 263, "abstract": null, "full_text": "256 \n\nAN INFORMATION  THEORETIC APPROACH TO \n\nRULE-BASED  CONNECTIONIST EXPERT SYSTEMS \n\nRodney  M.  Goodman,  John  W.  Miller \nDepartment of Electrical Engineering \nC altech  116-81 \nPasadena,  CA 91125 \n\nPadhraic Smyth \nCommunication Systems Research \nJet  Propulsion Laboratories 238-420 \n4800 Oak Grove  Drive \nPasadena,  CA 91109 \n\nAbstract \n\nWe discuss in this paper architectures for executing probabilistic rule-bases in a par(cid:173)\nallel manner,  using  as  a theoretical basis recently introduced information-theoretic \nmodels.  We will begin by describing our (non-neural) learning algorithm and theory \nof quantitative rule  modelling, followed  by  a discussion on  the exact nature of two \nparticular models.  Finally we work through an example of our approach, going from \ndatabase to rules to inference network, and compare the network's performance with \nthe theoretical limits for  specific  problems. \n\nIntroduction \n\nWith  the  advent  of  relatively  cheap  mass  storage  devices  it  is  common  in  many \ndomains  to maintain  large  databases  or  logs  of data,  e.g.,  in  telecommunications, \nmedicine,  finance,  etc.  The  question naturally  arises  as  to whether we  can extract \nmodels from  the data in  an  automated  manner  and use  these  models  as  the basis \nfor  an  autonomous rational agent  in  the given domain, i.e.,  automatically generate \n\"expert systems\"  from  data.  There  are  really  two  aspects  to this problem:  firstly \nlearning  a  model  and,  secondly,  performing  inference  using  this  model.  What  we \npropose  in  this  paper  is  a  rather  novel  and  hybrid  approach  to  learning  and  in(cid:173)\nference.  Essentially  we  combine  the  qu'alitative  knowledge  representation  ideas  of \nAI  with  the  distributeq,  computational  advantages  of  connectionist  models,  using \nan  underlying  theoretical basis  tied  to  information  theory.  The  knowledge  repre(cid:173)\nsentation  formalism  we  adopt  is  the  rule-based  representation,  a  scheme  which  is \nwell supported by  cognitive scientists  and  AI researchers  for  modeling higher level \nsymbolic reasoning tasks.  We  have recently developed  an  information-theoretic  al(cid:173)\ngorithm called  ITRULE which extracts an  optimal set of probabilistic rules from  a \ngiven data set [1,  2,  3].  It must be emphasised that we  do not use any form of neural \nlearning such  as  backpropagation  in  our approach.  To  put  it  simply,  the  ITRULE \nlearning  algorithm  is  far  more computationally  direct  and better understood  than \n(say)  backpropagation  for  this  particular  learning  task  of  finding  the  most  infor(cid:173)\nmative individual rules without reference  to their collective properties.  Performing \nuseful inference with  this  model or set  of rules,  is  quite  a difficult  problem.  Exact \ntheoretical  schemes  such  as  maximum  entropy  (ME)  are  intractable  for  real-time \napplications. \n\n\fAn Infonnation Theoretic Approach to Expert Systems \n\n257 \n\nWe  have  been  investigating schemes  where the rules  represent  links  on  a  directed \ngraph  and  the  nodes  correspond  to  propositions,  i.e.,  variable-value  pairs.  Our \napproach is  characterised by loosely connected,  multiple  path  (arbitrary topology) \ngraph  structures,  with nodes  performing local non-linear decisions  as  to their true \nstate based on  both supporting evidence  and their  a  priori bias.  What we  have in \nfact is  a recurrent neural network.  What is different  about this approach compared \nto a standard connectionist model as learned by a weight-adaptation algorithm such \nas  BP? The difference  lies  in  the  semantics of the representation [4].  Weights such \nas log-odds ratios based on log transformations of probabilities possess a clear mean(cid:173)\ning to the user,  as indeed do the nodes themselves.  This explicit representation of \nknowledge is a key requirement for any system which purports to perform reasoning, \nprobabilistic or otherwise.  Conversely, the lack of explicit knowledge representation \nin most current  connectionist  approaches, i.e.,  the  \"black  box\"  syndrome,  is  a  ma(cid:173)\njor  limitation  to  their  application  in  critical  domains  where  user-confidence  and \nexplanation facilities  are key  criteria for deployment  in  the field. \n\nLearning the model \n\nConsider  that  we  have  M  observations  or  samples  available,  e.g.,  the  number  of \nitems  in  a  database.  Each  sample  datum  is  described  in  terms  of  N  attributes \nor features,  which  can  assume  values  in  a  corresponding  set  of  N  discrete  alpha(cid:173)\nbets.  For example our data might be described in the form of lO-component binary \nvectors.  The requirement  for  discrete  rather  than  continuous-valued  attributes  is \ndictated by the very nature of the rule-based representation.  In addition it is  impor(cid:173)\ntant to note that we  do not assume that the sample data is somehow exhaustive and \n\"correct.\"  There is a tendency in both the neural network and AI learning literature \nto analyse learning in terms of learning a  Boolean function from  a truth table.  The \nimplicit  assumption  is  often  made  that given  enough  samples,  and  a  good  enough \nlearning algorithm we  can always learn the function exactly.  This is  a fallacy,  since \nit  depends  on  the  feature  representation.  For  any  problem  of  interest  there  are \nalways hidden  causes with  a  consequent  non-zero  Bayes  misclassification risk,  i.e., \nthe function  is  dependent on non-observable features  (unseen columns of the truth \ntable).  Only  in  artificial  problems  such  as  game  playing  is  \"perfect\"  classification \nin  practical problems  nature hides  the real features.  This phenomenon \npossible  -\nis  well  known  in  the  statistical  pattern recognition  literature  and  renders  invalid \nthose  schemes which simply try to perfectly classify or memorise  the training data. \n\nWe  use  the following  simple model of a rule,  i.e., \n\nIT  Y  = y  then  X  = x  with  probability p \n\nwhere X  and Yare two attributes (random variables) with \"x\"  and \"y\"  being values \nin  their respective  discrete  alphabets.  Given  sample  data as  described  earlier  we \npose  the  problem  as  follows:  can  we  find  the  \"best\"  rules  from  a  given  data set, \nsay  the  K  best  rules?  We  will  refer  to  this  problem  as  that  of  generalised  rule \ninduction,  in  order  to  distinguish  it from  the  special case  of deriving classification \n\n\f258 \n\nGoodman, Miller and Smyth \n\nrules.  Clearly we require both a preference measure to rank the rules and a learning \nalgorithm which uses  the preference measure to find  the K  best rules. \n\nLet  us  define  the  information  which  the  event  y  yields  about  the  variable  X,  say \n!(Xj y).  Based  on  the  requirements  that  !(Xj y)  is  both  non-negative  and  that \nits expectation with respect  to Y  equals  the  average mutual information  J(Xj Y), \nBlachman [5]  showed that the  only such function  is  the j-measure, which  is  defined \nas \n\ni(Xj y)  =  p(x\\y) log (p(x\\y)) + p(x\\y) log (p(x)~y)) \n\np(x) \n\np(x) \n\nMore  recently  we  have  shown  that  i(Xj y)  possesses  unique  properties  as  a  rule \ninformation  measure  [6]. \nIn  general  the  j-measure  is  the  average  change  in  bits \nrequired  to specify X  between the  a  priori distribution (p(X))  and the  a  posteriori \ndistribution (p(X\\y)).  It can also be interpreted as a special case of the cross-entropy \nor binary discrimination (Kullback [7])  between these two distributions.  We further \ndefine  J(Xj y)  as  the  average  information content where  J(X; y)  =  p(Y)-i(Xj y). \nJ(Xj y)  simply weights the instantaneous rule information i(X; y)  by the probability \nthat  the  left-hand  side  will  occur,  i.e.,  that  the rule  will  be fired.  This  definition \nis  motivated  by  considerations  of  learning  useful  rules  in  a  resource-constrained \nenvironment.  A rule with high  information content must be both  a  good  predictor \nand  have  a  reasonable  probability  of  being  fired,  i.e.,  p(y)  can  not  be  too  small. \nInterestingly enough our definition of J(Xj y)  possesses a well-defined interpretation \nin  terms  of  classical  induction  theory,  trading  off  hypothesis  simplicity  with  the \ngoodness-of-fit  of the hypothesis to the data [8]. \n\nThe ITRULE algorithm [1,  2,  3]  uses  the  J-measure to derive the most informative \nset of rules from  an  input data set.  The algorithm produces a set of K  probabilistic \nrules, ranked in order of decreasing information content.  The parameter K  may be \nuser-defined  or determined via some statistical significance test based on the size of \nthe  sample data set  available.  The  algorithm searches  the space  of possible  rules, \ntrading off generality of the rules with their predictiveness,  and using  information(cid:173)\ntheoretic bounds to constrain the search space. \n\nUsing the Model to Perform Inference \n\nHaving  learned  the  model we  now  have  at  our  disposal  a  set  of lower  order  con(cid:173)\nstraints on the  N-th order joint distribution in the form of probabilistic rules.  This \nis  our  a  priori  model.  In  a  typical  inference  situation  we  are  given  some  initial \nconditions  (i.e.,  some  nodes  are  clamped),  we  are  allowed  to measure  the  state  of \nsome other nodes (possibly  at a cost),  and we  wish  to infer the state or probability \nof one more  goal propositions or nodes from  the available evidence.  It  is  important \nto note that this is  a much more difficult  and general problem than classification of \na single,  fixed,  goal variable, since  both the initial conditions and goal  propositions \nmay  vary  considerably  from  one  problem  instance  to  the  next.  This  is  the  infer(cid:173)\nence problem, determining an a posteriori distribution in the face  of incomplete and \nuncertain information.  The exact  maximum entropy solution to this problem is in-\n\n\fAn Information Theoretic Approach to Expert Systems \n\n259 \n\ntractable and, despite the elegance of the problem formulation, stochastic relaxation \ntechniques  (Geman [9])  are at present impractical for real-time robust applications. \nOur  motivation  then  is  to  perform  an  approximation  to  exact  Bayesian  inference \nin  a  robust  manner.  With  this  in  mind  we  have  developed  two  particular models \nwhich we  describe  as  the hypothesis  testing  network  and the uncertainty network. \n\nPrinciples of the Hypothesis Testing Network \n\nIn  the first  model under consideration each  directed  link  from  Y to x  is  assigned  a \nweight corresponding to the weight of evidence of yon x.  This idea is not necessarily \nnew,  although our interpretation and approach is different to previous work  [10,  4]. \nHence we  have \n\nW \n\n:r.y  -\n\n-1  p{xIY)  -1  p(:xIY) \nog  p(x) \n\nog  p(x) \n\nand  R  = -log p(x) \np(x) \n\n:r. \n\nand the node  x  is  assigned  a  threshold term corresponding to  a priori bias.  We  use \na  sigmoidal activation function,  i.e., \n\n1 \n\na ( x)  = --~7'\"\"E=-t----;;R'--, \n\nl+e \n\nT \n\nwhere \n\nn \n\nl:J.E:r.  =  I: W:r.y;  . q(y,)  - R:r. \n\n,=1 \n\nbased on  multiple  binary  inputs  Y1 ... Yn  to  x.  Let  8  be the set of all Yi  which  are \nhypothesised true  (Le.,  a{yd  =  1),  so  that \n\nAE  =  I  p(x) + '\" (1  p(xlYd  _ 1 p(xIY,)) \n\nog  p(x) \n\nog  p(x) \n\nL.l:r. \n\nog p(x)  L-\ny;ES \n\nIf each  y,  is  conditionally independent given  x  then we  can write \n\np(xIS)  =  p(x)  II p(xIY,) \np(x)  y;ES p(xlYd \np(xIS) \n\nTherefore the  updating rule for  conditionally independent  y,  is: \n\nT . log  a(x) \n\n1 - a(x) \n\n=  log  p(xI8) \n\n1 - p(x/S) \n\nHence  a(x)  >  ~  iff  p{xI8) >  ~ and if T ==  1,  a(x)  is  exactly p(xIS).  In  terms of a \nhypothesis  test,  a(x)  is  chosen  true iff: \n\n' \"  I  p(XIYi)  > \nL- og \np(XIYi)  -\n\nI  p{x) \n- og--\np(x) \n\nSince  this describes the Neyman-Pearson decision region for  independent  measure(cid:173)\nments  (evidence or  yd  with  R:r.  = -log :~~~  [11],  this  model can be interpreted  as \na  distributed form  of hypothesis testing. \n\n\f260 \n\nGoodman, Miller and Smyth \n\nPrinciples of the Uncertainty Network \nFor  this model we  defined  the weight  on  a directed link from  Yi  to x as \n\nW XYi  =  si.1(XjYi)  =  Si\u00b7  p(XIYi}log(  p(x)  ) + p(xly,)log(  p(x) \n\np(XIYi) \n\n_ \n\np(xIYi))) \n\n. \n\n( \n\nwhere  Si  =  \u00b11  and  the  threshold  is  the  same  as  the  hypothesis  model.  We  can \ninterpret  W:Zlli  as  the change in bits  to specify  the  a  posteriori distribution of x.  H \nP(XIYi)  > p{x),  w:ZYi  has positive support for x, i.e., Si  = +1.  H P{XIYi)  < p(x),  W:Zlli \nhas  negative support for  x,  Le.,  Si  = -1.  IT  we  interpret the activation a(Yi)  as  an \nestimator (p(y))  for  p(Yi),  then for multiple inputs, \n\nP(XIYi)  ) \n- ~ p(Yi).Si.  p(XIYi  log(  p{x)  ) + P xly,) log(  p(x)  ) \n\nP(XIYi) \n\n(_ \n\n( \n\n) \n\ni \n~ .. \n\u2022 \n\nThis sum over  input  links  weighted  by  activation functions  can  be  interpreted  as \nthe total directional change in bits required to specify  x,  as  calculated locally  by  the \nnode  x.  One can normalise  !:1Ex  to obtain an average change in bits by  dividing by \na suitable temperature T.  The node x can make a local decision by recovering p(x) \nfrom  an inverse  J-measure transformation of !:1E  (the sigmoid is  an  approximation \nto this inverse function). \n\nExperimental Results  and Conclusions \n\nIn  this  section  we  show  how  rules  can  be  generated  from  example  data and  auto(cid:173)\nmatically  incorporated  into  a  parallel  inference  network  that  takes  the  form  of  a \nmulti-layer  neural  network.  The  network  can  then  be  \"run\"  to  perform  parallel \ninference.  The domain we  consider is  that of a financial database of mutual funds, \nusing  published  statistical  data  [12].  The  approach  is,  however,  typical  of  many \ndifferent real world  domains. \n\nFigure  1  shows  a  portion  of  a  set  of  typical  raw  data  on  no-load  mutual  funds. \nEach line is  an instance of a fund  (with name omitted), and each column represents \nan  attribute  (or  feature)  of the  fund.  Attributes  can be numerical or  categorical. \nTypical categorical attributes are  the fund  type which reflect  the investment objec(cid:173)\ntives of the fund  (growth, growth and income,  balanced, and agressive growth)  and \na  typical numerical  attribute  is  the  five  year  return  on  investment  expressed  as  a \npercentage.  There are  a  total of 88  fund  examples  in  this  data set.  From  this raw \ndata a second quantized set of the 88 examples is  produced to serve as  the input to \nITRULE (Figure 2).  In this example the attributes have been categorised to binary \nvalues  so  that they  can be directly  implemented  as  binary neurons.  The ITRULE \nsoftware  then processes  this  table  to produce  a  set  of rules.  The rules  are  ranked \nin order  of decreasing  information  according  to  the  J-measure.  Figure  3 shows  a \n\n\fAn Infonnation Theoretic Approach to Expert Systems \n\n261 \n\nportion (the top ten rules)  of the ITRULE output for the mutual fund data set.  The \nhypothesis  test  log-likelihood  metric  h(Xj y),  the instantaneous j-measure  j(Xj y), \nand the average  J-measure J(Xj y),  are all shown,  together with the rule transition \nprobability p{x/y). \nIn  order  to  perform  inference  with  the  ITRULE  rules  we  need  to  map  the  rules \ninto  a  neural inference  net.  This  is  automatically  done  by  ITRULE which  gener(cid:173)\nates  a  network  file  that can  be loaded  into  a  neural network  simulator.  Thus rule \ninformation  metrics  become connection  weights.  Figure 4 shows  a  typical network \nderived from  the  ITRULE rule  output for  the mutual funds  data.  For clarity  not \nall  the  connections  are  shown.  The  architecture  consists  of two  layers  of neurons \n(or \"units\"):  an input layer and  an output layer,  both of which have an activation \nwithin  the range {O,l}.  There is  one unit in  the input layer  (and  a  corresponding \nunit  in  the output  layer)  for  each  attribute in the mutual funds  data.  The output \nfeeds  back to the input layer,  and each layer is synchronously updated.  The output \nunits  can  be  considered  to  be  the  right  hand  sides  of  the  rules  and  thus  receive \ninputs from  many rules,  where the strength of the connection  is  the rule's  metric. \nThe  output  units  implement  a  sigmoid  activation  function  on  the  sum  of the in(cid:173)\nputs,  and thus  compute an  activation which  is  an estimator of the right  hand side \nposteriori attribute value.  The input units simply pass this value on to the output \nlayer  and  thus  have  a  linear activation. \n\nTo  perform inference  on the  network,  a  probe  vector of attribute values  is  loaded \ninto  the  input  and  output  layers.  Known  values  are  clamped  and  cannot  change \nwhile  unknown  or  desired  attribute  values  are  free  to  change.  The  network  then \nrelaxes  and  after several feedback  cycles  converges  to a  solution which  can be read \noff the input  or output units.  To evaluate the models we  setup fo~r standard clas(cid:173)\nsification tests with varying number of nodes clamped as inPlits.  Undamped nodes \nwere set to their  a  priori probability.  After relaxing  the network,  the activation of \nthe  \"target\"  node was compared with  the true  attribute values for  that  sample in \norder to  determine  classification  performance.  The  two  models  were  each  trained \non 10 randomly selected sets of 44 samples.  The performance results given in  Table \n1 are  the  average  classification rate of the models  on  the other 44  unseen samples. \nThe Bayes risk  (for a  uniform loss matrix) of each classification  test was  calculated \nfrom the 88  samples.  The actual performance of the networks occasionally exceeded \nthis value due  to small sample variations on  the 44/44 cross validations. \n\nTable  1 \n\nUnits  Cramped  Uncertainty  Test  HYPOthesis  Test \n\n1  - Bayes'  Risk \n\n9 \n5 \n2 \n1 \n\n66.8% \n70.1% \n48.2% \n51.4% \n\n70.4% \n70.1% \n63.0% \n65.7% \n\n88.6% \n80.6% \n63.6% \n64.8% \n\n\f262 \n\nGoodman, Miller and Smyth \n\nWe  conclude  from  the  performance  of  the  networks  as  classifiers  that  they  have \nindeed learned a  model of the data using a rule-based representation.  The hypoth(cid:173)\nesis  network  performs slightly  better than  the  uncertainty model,  with  both being \nquite  close  to  the  estimated  optimal rate  (the  Bayes'  risk).  Given  that  we  know \nthat the independence assumptions in both models do not hold exactly, we coin the \nterm  robust inference to describe this kind  of accurate behaviour in the presence of \nincomplete and  uncertain  information.  Based on  these  encouraging  initial results, \nour current  research  is  focusing  on  higher-order  rule  networks  and  extending  our \ntheoretical understanding of models  of this nature. \n\nAcknowledgments \n\nThis  work  is  supported  in  part  by  a  grant  from  Pacific  Bell,  and  by  Caltech's \nprogram in  Advanced  Technologies  sponsored by  Aerojet  General,  General Motors \nand  TRW.  Part of the research  described  in this paper was  carried out by the  Jet \nPropulsion  Laboratory,  California  Institute  of  Technology,  under  a  contract  with \nthe  National  Aeronautics  and Space  Administration.  John  Miller  is  supported  by \nNSF  grant no.  ENG-8711673. \n\nReferences \n\n1.  R. M.  Goodman and P.  Smyth, 'An information theoretic model for  rule-based \nexpert systems,' presented at the 1988 International Symposium on Information \nTheory,  Kobe,  Japan. \n\n2.  R.  M.  Goodman and P.  Smyth, 'Information theoretic rule induction,' Proceed(cid:173)\n\nings  of the  1988 European  Conference  on  AI,  Pitman Publishing:  London. \n\n3.  R.  M.  Goodman  and  P.  Smyth,  'Deriving  rules  from  databases:  the  ITRULE \n\nalgorithm,' submitted for publication. \n\n4.  H.  Geffner  and  J.  Pearl,  'On the  probabilistic semantics of connectionist  net(cid:173)\n\nworks,'  Proceedings  of the  1987 IEEE ICNN,  vol.  II,  pp.  187-195. \n\n5.  N.  M.  Blachman,  'The  amount  of information  that  y  gives  about  X,'  IEEE \n\nTransactions  on  Information  Theory,  vol.  IT-14 (1),  27-31,  1968. \n\n6.  P.  Smyth  and  R.  M.  Goodman,  'The  information  content  of  a  probabilistic \n\nrule,' submitted for  publication. \n\n7.  S.  Kullback,  Information  Theory  and Statistics,  New  York:  Wiley,  1959. \n8.  D.  Angluin  and  C.  Smith,  'Inductive  inference:  theory  and  methods,'  ACM \n\nComputing  Surveys,  15(9),  pp.  237-270,  1984. \n\n9.  S.  Geman, 'Stochastic relaxation methods for image restoration and expert sys(cid:173)\n\ntems,' in Maximum Entropy and Bayesian Methods  in Science  and Engineering \n(Vol.  2),  265-311,  Kluwer  Academic  Publishers,  1988. \n\n10.  G.  Hinton and T. Sejnowski,  'Optimal perceptual inference,'  Proceedings  of the \n\nIEEE CVPR 1989. \n\n11.  R.  E.  Blahut,  Principles  and Practice  of Information  Theory,  Addison-Wesley: \n\nReading,  MA,  1987. \n\n12.  American  Association  of Investors,  The  individual investor's  guide  to  no-load \n\nmutual funds,  International Publishing Corporation:  Chicago,  1987. \n\n\fAn Infonnation Theoretic Approach to Expert Systems \n\n263 \n\nFund Type \n\n5 Year  Diver- Beta  Bull  Bear Stocks  Invest- Net \nReturn  sity \n0/0 \n\n(Risk) Perf. Perf. \n\nDistri- Expense Turn- Total \n0/0  ment  Asset  butions  Ratio  % over  Assets \n\nBalanced \nGrowth \nGrowth& Income  88.3 \nAgressive \nGrowth&lncome \nBalanced \n\n136  C \n32 .5  C \nA \n-24  A \n172  E \n144  C \n\n0.8  B \n1.05  E \n0.96  C \n1.23  E \n0.59  A \n0.71  B \n\nD \nB \nD \nE \nB \nB \n\n87 \n81 \n82 \n95 \n73 \n51 \n\nIncm.  $ Value $ (%NAV\\ \n0.67  37 .3  17 .63 \n0.88 \n4 .78 \n9 .30 \n9.97 \n13  10.44 \n\n- 0.02  12.5 \n0.14  11.9 \n0.02  6.45 \n0.53  13.6 \n0.72 \n\nRate %$M \n\n0 .79 \n\n1.4  200 \n1.34  127 \n1 61 \n\n34  415 \n16 \n27 \n64 \n1.4 \n1.09 \n31  113 \n0.98  239  190 \n\nFlgure1. \n\nRaw Mutual Funds  Data \n\nType Type Type Type  5 Year \n\nA \n\nB \n\nG \n\n(?J  Return  0/0 \n\nBeta  Stocks  Turn-\n>90%  over \n\nAssets  Distri- Diver- Bull  Bear \nsity  Perf.  Perf. \n\nbutions \n\nS&P=1380/0 \nabove S&P \nbelow S&P \n\nno \nno \nno \nno \nno \nno \n\nno \nno \nno \nno \nno \nno \n\nyes  no  below \nyes  no  below \nno \nyes  below \nno \nyes  above \nno \nyes  below \nyes  no  above \n\nunder1 \nover1 \nunder1 \nunder1 \nunder1 \nunder1 \n\nno \nno \nno \nno \nyes \nno \n\n<100%  <$100M  <150/0NAV  C.D.E  C.D.E  C.D.E \n>100%  >$100M  >150/0NAV  A.B  AB  A,B \nlow \nhigh \nlow  high \nlow \nlow \nhigh  high \nlow  high \nlow \nhigh \n\nlarge \nsmall \nsmall \nlarge \nsmall \nlarge \n\nhigh \nlow \nlow \nlow \nhigh \nhigh \n\nlow \nlow \nhigh \nlow \nhigh \nhigh \n\nlow \nhigh \nhigh \nlow \nlow \nlow \n\nFigure  2. \n\nQuantized Mutual Funds Data \n\nITRULE  rule  output:  Mutual  Funds \n\np(x/y) \n\nj(X;y)  J(X;y)  h(X;y) \n\n1  IF \n2  IF \n3  IF \n4  IF \n5  IF \n6  IF \n7  IF \n8  IF \n9  IF \n10 IF \n\n5yrRebS&P \nBullJ)erf \nAssets \nBullJ)erf \ntypeA \nBullJ)erf \ntypeGl \nBullJ)erf \ntypeG \nAssets \n\nabove \nlow \nlarge \nhigh \nyes \nlow \nyes \nhigh \nyes \nsmall \n\nlHEN \nlHEN \nlHEN \nlHEN \nlHEN \nlHEN \nlHEN \nlHEN \nlHEN \nlHEN \n\nBullJ)erf  high \n\n5yrRet>S&P  below \n\nBullJ)erf  high \n\n5yrRet>s&P  above \n\ntypeG  no \nAssets  small \ntypeG  no \nAssets \ntypeA  no \nlow \n\nlarge \n\nBull  perf \n\n0.97 \n0.98 \n0.81 \n0.40 \n0 .04 \n0.18 \n0.05 \n0.72 \n0.97 \n0 .26 \n\n0.75  0.235 \n0.41 \n0 .201 \n0.28  0.127 \n0.25  0.127 \n0 .50 \n0 .123 \n0.25 \n0.121 \n0 .49  0.109 \n0.21 \n0.109 \n0.27  0.108 \n0.19  0.103 \n\n4.74 \n4.31 \n2.02 \n-1.71 \n-3 .87 \n-1 .95 \n-3.74 \n1.64 \n3 .54 \n-1.57 \n\nFigure  3. Top Ten Mutual Funds Rules \n\nnfo2atl~  0 0 ~  ~  ~ Input layer - linear units \n\nD D \n\nmetric connection \nweights \n\nI \nI \none unit per attribute I \n\no DOD  0 D D 0 I \n\nI \n\nFigure 4.  Rule Network \n\nFeedback connections \nweight = 1 \n\no output layer - sigmoid units \n\n\f", "award": [], "sourceid": 150, "authors": [{"given_name": "Rodney", "family_name": "Goodman", "institution": null}, {"given_name": "John", "family_name": "Miller", "institution": null}, {"given_name": "Padhraic", "family_name": "Smyth", "institution": null}]}