{"title": "Best-First Model Merging for Dynamic Learning and Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 958, "page_last": 965, "abstract": null, "full_text": "Best-First Model Merging for \n\nDynamic Learning and Recognition \n\nStephen M. Omohundro \n\nInternational Computer Science Institute \n\n1947 CenteJ' Street, Suite 600 \nBerkeley, California 94704 \n\nAbstract \n\n\"Best-first  model  merging\"  is  a  general  technique  for  dynamically \nchoosing  the  structure of a  neural  or related  architecture while avoid(cid:173)\ning  overfitting.  It is  applicable  to both  leaming  and  recognition  tasks \nand often generalizes significantly better than fixed structures. We dem(cid:173)\nonstrate the approach applied to the tasks of choosing radial basis func(cid:173)\ntions for function  learning, choosing  local  affine models  for  curve and \nconstraint surface modelling, and choosing the structure of a balltree or \nbumptree to maximize efficiency of access. \n\n1  TOWARD MORE COGNITIVE LEARNING \nStandard backpropagation neural networks learn in a way which appears to be quite differ(cid:173)\nent from  human leaming. Viewed as a cognitive system, a standard network always main(cid:173)\ntains  a  complete  model  of its  domain.  This  model  is  mostly  wrong  initially,  but  gets \ngradually better and better as data appears. The net deals with all data in  much  the same \nway and has no representation for the strength of evidence behind a certain conclusion. The \nnetwork architecture is usually chosen before any data is seen and the processing is much \nthe same in the early phases of learning as in the late phases. \nHuman and animalleaming appears to proceed in quite a different manner. When an organ(cid:173)\nism has not had many experiences in a domain of importance to it, each individual experi(cid:173)\nence is critical. Rather than use such an experience to slightly modify the parameters of a \nglobal model, a better strategy is to remember the experience in detail. Early in learning. an \norganism doesn't know which features of an experience are important unless it has a strong \n\n958 \n\n\fBest-First Model  Merging for  Dynamic Learning and Recognition \n\n959 \n\nprior knowledge of the domain. Without such prior knowledgeJts best strategy is to gener(cid:173)\nalize on the basis of a similarity measure to individual stored experiences. (Shepard, 1987) \nshows that there is a universal exponentially decaying form for this kind of similarity based \ngeneralization over a wide variety of sensory domains in several studied species. As expe(cid:173)\nriences accumulate, the organism eventually gets enough data to reliably validate models \nfrom complex classes. At this point the animal need 00 longer remember individual expe(cid:173)\nriences, but rather only the discovered generalities (eg. as rules). With such a strategy, it is \npossible for a system to maintain a measure of confidence in it its predictions while build(cid:173)\ning ever more complex models of its environment. \nSystems based on these two types of learning have also appeared in the neural network, sta(cid:173)\ntistics and machine learning communities. In the learning literature one finds both \"table(cid:173)\nlookup\" or \"memory-based\" methods and ''parameter-fitting\" methods. In statistics the dis(cid:173)\nlinction is made between \"non-parametric\" and \"parametric\" methods. Table-lookup meth(cid:173)\nods work by storing examples and generalize to new situations on the basis of similarity to \nthe old ones. Such methods are capable of one-shot learning and have a measure of the ap(cid:173)\nplicability of their knowledge to new situations but are limited in their generalization capa(cid:173)\nbility. Parameter fitting models choose the parameters of a predetermined model to best fit \na set of examples. They usually take longer to train and are susceptible to computational \ndifficulties such as local maxima but can potentially generalize better by extending the in(cid:173)\nfluence of examples over the whole space. Aside from computational difficulties, their fun(cid:173)\ndamental  problem  is  overfitting,  ie.  having  insufficient  data  to  validate  a  particular \nparameter setting as useful for generalization. \n\n2  OVERFITTING IN LEARNING AND RECOGNITION \nThere have been many recent results (eg.  based on  the Vapnik-Chervonenkis dimension) \nwhich identify the number of examples needed to validate choices made from specific para(cid:173)\nmetric  model  families.  We  would  like a learning system to  be able to  induce extremely \ncomplex models of the world but we don't want to have to present it with the enormous \namount of data needed to validate such a model unless it is really needed. (Vapnik, 1982) \nproposes a technique for avoiding overfitling while allowing models of arbitrary complex(cid:173)\nity. The idea is to start with a nested familty of model spaces, whose members contain ever \nmore complex models. When the system has only a small amount of data it can only vali(cid:173)\ndate models in in the smaller model classes. As more data arrives, however, the more com(cid:173)\nplex classes may be considered. If at any point a fit is found to within desired tolerances, \nhowever, only the amount of data needed by the smallest class containing the chosen model \nis needed. Thus there is  the potential for choosing complex models without penalizing sit(cid:173)\nuations in which the model is simple. The model merging approach may be viewed in these \nterms except that instead of a single nested family, there is a widely branching tree of model \nspaces. \nLike  learning,  recognition  processes  (visual,  auditory,  etc.)  aim  at constructing  models \nfrom data. As such they are subject to the same considerations regarding overfitling. Figure \n1 shows a perceptual example where a simpler model  (a single segment)  is  perceptually \nchosen to explain the data (4 almost collinear dots) than a more complex model (two seg(cid:173)\nments) which fits the data better. An intuitive explanations is that if the dots were generated \nby two segments, it would be an amazing coincidence that they are almost collinear, if it \nwere generated by one, that fact is easily explained. Many of the Gestalt phenomena can be \n\n\f960 \n\nOmohundro \n\nconsidered in the same tenns. Many of the processes used in recognition (eg. segmentation, \ngrouping) have direct analogs in learning and vice versa. \n\n\u2022 \n\nIs:. \n\n\u2022  \u2022 \n\n= \n\n\u2022 \n\n_______  or \n\nFigure 1: An example of Occam's razor in recognition. \n\nThere has  been much recent interest in  the network community in  Bayesian methods for \nmodel selection while avoiding overfilling (eg. Buntine and Weigend,  1992 and MacKay \n1992). Learning and recognition fit naturally together in a Bayesian framework. The Baye(cid:173)\nsian approach makes explicit the need  for a prior distribution.  The posterior distribution \ngenerated by learning becomes the prior distribution for recognition. The model merging \nprocess described in this paper is applicable to both phases and the knowledge representa(cid:173)\ntion it suggests may be used for both processes as well. \nThere are at least three properties of the world that may be encoded in a prior distribution \nand have a dramatic effect on learning and recognition and are essential to the model merg(cid:173)\ning approach. The continuity prior is that the world is geometric and unless there is contrary \ndata a system should prefer continuous models over discontinuous ones. This prior leads to \na wide variety of what may be called \"geometric learning algorithms .. (Omohundro, 1990). \nThe sparseness prior is that the world is sparsely interacting. This says that probable mod(cid:173)\nels naturally decompose into components which only directly affect one another in a sparse \nmanner. The primary origin of this prior is that physical objects usually only directly affect \nnearby objects in space and time. This prior is responsible for the success of representations \nsuch as Markov random fields and Bayesian networks which encode conditional indepen(cid:173)\ndence relations. Even if the individual models consist of sparsely interacting components, \nit still might be that the data we receive for learning or recognition depends in an intricate \nway on all components.  The locality prior prefers models in  which  the data decomposes \ninto components which are directly affected by only a small number of model components. \nFor example, in the learning setting only a small portion of the knowledge base will be rel(cid:173)\nevant to any specific situation. In the recognition setting, an individual pixel is detennined \nby only a small number of objects in the scene. In geometric settings, a localized represen(cid:173)\ntation allows only a small number of model parameters to affect any individual prediction. \n\n3  MODEL MERGING \nBased on the above considerations, an ideal learning or recognition system should model \nthe world using a collection of sparsely connected, smoothly parameterized, localized mod(cid:173)\nels. This is an apt description of many of the neural network models currently in use. Baye(cid:173)\nsian  methods  provide an  optimal  means  for  induction  with  such  a choice of prior over \nmodels but are computationally intractable in complex situations. We would therefore like \nto develop heuristic approaches which approximate the Bayesian solution and avoid over(cid:173)\nfitting.  Based on the idealization of animal learning in the frrst section, we would like is a \nsystem  which smoothly moves between a memory-based regime in which the models are \nthe data into ever more complex parameterized models. Because of the locality prior, model \n\n\fBest-First Model Merging for  Dynamic Learning and Recognition \n\n961 \n\ncomponents only affect a subset of the data. We can  therefore choose the complexity of \ncomponents which are relevant to different portions of the data space according to the data \nwhich has been received there. This allows for reliably validated models of extremely high \ncomplexity in some regions of the space while other portions are modeled with low com(cid:173)\nplexity. If only a small number of examples have been seen in some region, these are simply \nremembered and generalization is based on similarity. As more data arrives, if regularities \nare found  and there is enough data present to justify them. more complex parameterized \nmodels are incorpoolted. \nThere are many possible approaches to implementing such a strategy. We have investigated \na particular heuristic which can be made computationally efficient and appears to work well \nin a variety of areas. The best-first model merging approach is applicable in a variety of sit(cid:173)\nuations in which complex models are constructed by combining simple ones. The idea is to \nimprove a complex model by replacing two  of its component models by a single model. \nThis \"merged\" model may be in  the same family as the original components. More inter(cid:173)\nestingly. because the combined data from  the merged components is used in determining \nthe parameters of the merged model, it may come from a larger parameterized class. The \ncritical idea is to never allow  the system to hypothesize a model  which is more complex \nthan can be justified by the data it is based on. The \"best-first\" aspect is  to always choose \nto merge the pair of models which decrease the likelihood of the data the least. The merging \nmay be stopped according to a variety of criteria which are now applied to individual model \ncomponents rather  than  the entire model.  Examples of such  criteria are  those  based on \ncross-validation, Bayesian Occam factors,  VC bounds, etc. In experiments in a variety of \ndomains,  this  approach  does  an  excellent job of discovering  regularities  and  allocating \nmodelling resources efficiently. \n\n3  MODEL MERGING VS. K\u00b7MEANS FOR RBF'S \nOur rust example is the problem of choosing centers in radial basis function networks for \napproximating functions. In the simplest approach, a radial basis function (eg. a Gaussian) \nis located at each training input location. The induced function is a linear combination of \nthese basis functions which minimizes the mean square error of the training examples. Bet(cid:173)\nter models may be obtained by using fewer basis functions than data points. Most work on \nchoosing the centers of these functions  uses a clustering technique such as k-means  (eg. \nMoody and Darken, 1989). This is reasonable because it puts the representational power of \nthe  model  in  the regions of highest density where errors are more critical.  It ignores  the \nstructure of the modelled function, however. The model merging approach starts with a ba(cid:173)\nsis function at each training point and successively merges pairs which increase the training \nerror the least. We compared this approach with the k-means approach in a variety of cir(cid:173)\ncumstances. \nFigure 2 shows an example where the function on the plane to be learned is a sigmoid in x \ncentered at 0 and is constant in y. Thus the function varies most along the y axis. The data \nis drawn from a Gaussian distribution which is centered at (-.5,0). 21 training samples were \ndrawn from this distribution and from these a radial basis function network with 6 Gaussian \nbasis functions was learned. The X's in the figure show the centers chosen by k-means. As \nexpected, they are clustered near the center fo the Gaussian source distribution. The trian(cid:173)\ngles show the centers chosen by best-rust model merging. While there is some tendency to \nfocus on the source center, there is also a tendency to represent the region where the mod(cid:173)\nelled function varies the most. The training error is over 10 times less with model merging \n\n\f962 \n\nOmohundro \n\nand the test error 00 an independent test set is about 3 times lower. These results were typ(cid:173)\nical in variety of test runs. This simple example shows one way in which underlying struc(cid:173)\ntme is natmally discovered by the merging technique. \n\nDots are training points \n\n:  Triangles are mm centers \n\nA:  x's are k-means centers \n\n\u2022 \n\ny \n\n1.5 \n\n0.5 \n\n-0.5 \n\n-1.5 \n\nX \n\n:  21  samples, 6 centers \n:  RBF width .4 \n: Gaussian width .4 \n\u2022 ............ \\  Gaussian center -.5 \n:, Sigmoid width .4 \n\n\u2022 \n\n\"'X;\" \n\n(.. \n' ....  \u2022  x. \n\n\"I. \n\n.. !. ................. _......... \n\u2022 \u2022 \u2022  A. X \n\u2022 \n. \n\n\u2022  G \n\n>J \n\nA \n\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7 .................... _~~.!!!!n ... \u00b7\u00b7\u00b7 .... Sigmoid \n\u2022 \n\n\u2022  ... \u00b7\u00b7\u00b7\u00b7\"\"\u00b7\u00b7\u00b7 ....... ~ .. \u00b7 ... \u00b7\u00b7\u00b7\u00b7\u2022\u00b7 ... \u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7IIII .. \n\n,.\"  ~ \n\nX \n\n-1.0 \n\n-0.6 \n\n-0.2 \n\n0.2 \n\nx \n\nFigure 2: Radial basis function centers in two dimensions chosen by model \nmerging and by k-means. The dots show the 21  training samples. The x's are \nthe centers chosen by k-means, the triangles by model merging. The training \nerror was .008098 for k-means and .000604 for model merging. The test error \n\nwas .012463 for k-means and .004638 for model merging. \n\n4  APPROXIMATING CURVES AND SURFACES \nAs a second intuitive example, consider the problem of modelling a curve in the plane by \na  combination  of straight line  segments.  The  error function  may  be  taken  as  the  mean \nsquare error over each curve point to the nearest segment point A merging step in this case \nconsists of replacing two segments by a single segment We always choose that pair such \nthat the merged segment increases the emr the least. Figure 3 shows the approximations \ngenerated by this strategy. It does an excellent job at identifying the essentially linear por(cid:173)\ntions of the curve and puts the boundaries between component models at the \"comers\". The \ncorresponding \"top-down\" approach would start with a single segment and repeatedly split \nit This approach sometimes has to make decisions too early and often misses the comers \nin the curve. While not shown in the figure, as repeated mergings take place, more data is \navailable for each segment This would allow us to use more complex models than linear \nsegments such as Bezier curves. It is possible to reliably induce a representation which is \nlinear in some portions and higher order in others. Such models potentially have many pa(cid:173)\nrameters and would be subject to overfitting if they were learned directly rather than by go(cid:173)\ning through merge steps. \nExactly the same strategy may be applied to modelling higher-dimensional constraint sur(cid:173)\nfaces  by  hyperplanes  or functions  by  piecewise linear portions.  The model  merging  ap-\n\n\fBest-First Model Merging for  Dynamic Learning and Recognition \n\n963 \n\nproach naturally complements the efficient mapping and constraint surface representations \ndescribed in (Omohundro, 1991) based on bumptrees. \n\nError=1 \n\nError=2 \n\nError=5 \n\nError=10 \n\nError=20 \n\nFigure 3: Approximation of a curve by best-rust merging of segment models. The top row \n\nshows the endpoints chosen by the algorithm at various levels of allowed error. The \n\nbottom row shows the corresponding approximation to the curve. \n\nNotice, in this example, that we need only consider merging neighboring segments as the \nincreased error in merging non-adjoinging segments would be too great This imposes a lo(cid:173)\ncality on  the  problem  which  allows  for  extremely efficient computation.  The  idea  is to \nmaintain a priority queue with all potential merges on  it ordered by the  increase in error \ncaused by the merge. This consists of only the neighboring pairs (of which there are n-l if \nthere are n segments). The top pair on the queue is removed and the merge operation it rep(cid:173)\nresents  is performed if it doesn't violate the stopping critera.  The other potential  merge \npairs which  incorporated the merged segments must be removed from  the queue and the \nnew possible mergings with the generated segment must be inserted (alternatively, nothing \nneed  be  removed  and  each  pair is checked for  viability  when  it reaches  the  top  of the \nqueue). The neighborhood structure allows each of the operations to be performed quickly \nwith  the appropriate data structures and the entire merging process takes a time which is \nlinear (or linear times logarithmic)  in  the  number of component models.  Complex  time(cid:173)\nvarying curves may easily be processed in real time on typical workstations. In higher di(cid:173)\nmensions. hierarchical geometric data structures (as in Omohundro.  1987,  1990) allow a \nsimilar reduction in computation based on locality. \n\n\f964 \n\nOmohundro \n\nS  BALLTREE CONSTRUCTION \nThe model  merging approach is  applicable  to a wide variety of adaptive structures. The \n\"balItree\" structure described in (Omohundro. 1989) provides efficient access to regions in \ngeometric  spaces. It consists of a nested hiernrchy of hyper-balls surrounding given leaf \nballs and effICiently supports querries which test for intersection. inclusion. or nearness to \na leaf ball. The balItree construction algorithm  itself provides an example of a best-first \nmerge approach in a higher dimensional space. To detennine the  best hierarchy  we can \nmerge the leaf balls pairwise in such a way that the total volume of all the merged regions \nis as small as possible. The figure compares the quality of balltrees constructed using best(cid:173)\nflJ'St merging to those constructed using top-down and incremental algorithms. As in other \ndomains. the top-down approach has to make major decisions too early and often makes \nsuboptimal choices. The merging approach only makes global decisions after many local \ndecisions which adapt well to the structure. \n\nBalltree Error \n\n1074.48 \n\n859.58 \n\n644.69 \n\n429.79 \n\n214.90 \n\n/ . .  ~ J \"  ~ J \n\n0.00 \n\nl~~'~~~';-=-;:~::~:;~~ \no \n100  200  300  400  500 \n\nIry \n\n,., \n1\\  i Top Down Construction \n;  \" \n,  V \nI \n\". \n\nIncremental Construction \n\n- / \"  ;.st-filSt Merge Construction \n\nNumber of Balls \n\nFigme 4: Balltree error as a function of number of balls for the top-down. incremental. and \nbest-fust merging construction methods. Leaf balls have uniformly distributed centers in \n\n5 dimensions with radii uniformly distributed less than .1. \n\n6  CONCLUSION \nWe  have described a simple but powerful heuristic for  dynamically building models  for \nboth learning and recognition which constructs complex models that adapt well to the un(cid:173)\nderlying structure. We presented three different examples which only begin to touch on the \npossibilities. To hint at the broad applicability, we will briefly describe several other appli(cid:173)\ncations we are currently examining. \nIn (Omohundro.  1991) we presented an efficient structure for modelling mappings based \non a collection of local mapping models which were combined according to a partition of \nunity formed by \"influence functions\" associated with each model. This representation is \nvery flexible and can be made computationally efficient. While in the experiments of that \npaper, the local models were affme functions  (constant plus linear). they  may be chosen \nfrom any desired class. The model merging approach builds such a mapping representation \nby successively merging models and replacing them  with a new  model  whose influence \n\n\fBest-First Model  Merging  for  Dynamic Learning and Recognition \n\n965 \n\nfunction extends over the range of the two original influence functions. Because it is based \non more data, the new model can be chosen from a larger complexity class of functions than \nthe originals. \nOne of the most fundamental inductive tasks is density estimation. ie. estimating a proba(cid:173)\nblity distribution from samples drawn from it.  A powerful standard technique is adaptvie \nkernel estimation in which a nonnalized Gaussian (or other kernel) is placed at each sample \npoint with a width  determined by  the  local  sample density  (Devroye and Gyorfi.  1985). \nModel merging can be applied to improve the generalization performance of this approach \nby choosing successively more complex component densities once enough data has accu(cid:173)\nmulated by merging. For example. consider a density supported on a curve in a high dimen(cid:173)\nsi0l13l  space.  Initially  the estimate  will consist of radially-symmetric  Gaussians  at each \nsample point.  After successive mergings,  however,  the  one-dimensional linear structure \ncan be discovered (and the Gaussian components be chosen from the larger class of extend(cid:173)\ned Gaussians) and the generalization dramatically improVed. \nOther natural areas of application include inducing the structure of hidden Markov models. \nstochastic context-free grammars, Markov random fields. and Bayesian networks. \n\nReferences \n\nD. H. Ballard and C. M. Brown. (1982) Computer Vision. Englewood Cliffs, N. J: Prentice(cid:173)\nHall. \nW. L. Buntine and A. S. Weigend. (1992) Bayesian Back-Propagation. To appear in: Com(cid:173)\nplex Systems. \nL. Devroye and L. Gyorfi.  (1985) Nonparametric Density Estimation: The Ll View, New \nYork: Wiley. \nD. J.  MacKay. (1992) A Practical Bayesian Framework for  Backprop Networks.  Caltech \npreprint. \n1.  Moody and C.  Darken.  (1989)  Fast learning  in  networks  of locally-tuned  processing \nunits. Neural Computation, 1,281-294. \nS.  M.  Omohundro.  (1987)  Efficient algorithms  with  neural  network  behavior.  Complex \nSystems 1:273-347. \nS.  M. Omohundro.  (1989) Five balltree construction algorithms. International  Computer \nScience Institute Technical Report TR-89-063. \nS. M. Omohundro. (1990) Geometric learning algorithms. Physica D 42:307-321. \nS. M. Omohundro. (1991) Bumptrees for Efficient Function, Constraint, and Oassification \nLearning.  In  Lippmann,  Moody,  and Touretzky. (eds.) Advances in Neural Information \nProcessing Systems 3. San Mateo, CA: Morgan Kaufmann Publishers. \nR. N. Shepard. (1987) Toward a universal law of generalization for psychological science. \nScience. \nV.  Vapnik.  (1982)  Estimation  of Dependences  Based  on  Empirical  Data,  New  York: \nSpringer-Verlag. \n\n\f\fPART XIII \n\nARCHITECTURES \nAND ALGORITHMS \n\n\f\f", "award": [], "sourceid": 495, "authors": [{"given_name": "Stephen", "family_name": "Omohundro", "institution": null}]}