{"title": "Memory-based Stochastic Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 1066, "page_last": 1072, "abstract": null, "full_text": "Memory-based  Stochastic  Optimization \n\nAndrew W.  Moore and Jeff Schneider \n\nSchool of Computer Science \nCarnegie-Mellon University \n\nPittsburgh,  PA  15213 \n\nAbstract \n\nIn  this  paper  we  introduce  new  algorithms for  optimizing  noisy \nplants in which each experiment is  very expensive.  The algorithms \nbuild a global non-linear model of the expected output at the same \ntime as using Bayesian linear regression analysis of locally weighted \npolynomial models.  The local  model answers  queries  about  confi(cid:173)\ndence,  noise,  gradient  and  Hessians,  and  use  them  to  make auto(cid:173)\nmated decisions similar to those made by a practitioner of Response \nSurface  Methodology.  The global  and  local  models are  combined \nnaturally  as  a  locally  weighted  regression.  We  examine the  ques(cid:173)\ntion of whether the global model can really  help optimization, and \nwe  extend  it  to  the  case  of time-varying functions.  We  compare \nthe new algorithms with a  highly tuned higher-order stochastic op(cid:173)\ntimization algorithm on  randomly-generated functions  and  a  sim(cid:173)\nulated  manufacturing  task.  We  note  significant  improvements in \ntotal regret , time to converge, and final  solution quality. \n\nINTRODUCTION \n\n1 \nIn  a  stochastic  optimization  problem,  noisy  samples  are  taken  from  a  plant.  A \nsample consists of a chosen control u  (a vector ofreal numbers) and a noisy observed \nresponse  y.  y  is drawn from a  distribution with mean and variance that depend  on \nu.  y  is  assumed  to be independent  of previous experiments.  Informally t he  goal is \nto quickly find  control u  to maximize the expected output E[y I u) .  This is different \nfrom  conventional  numerical optimization because  the  samples can  be  very  noisy, \nthere is no gradient information , and we usually wish to avoid ever performing badly \n(relative  to  our  start  state)  even  during  optimization.  Finally  and  importantly: \neach experiment is very expensive and there is ample computational time \n(often  many minutes) for deciding on the next experiment.  The following questions \nare  both  interesting  and  important:  how  should  this computational time best  be \nused, and how  can the data best  be used? \n\nStochastic  optimization  is  of real  industrial  importance,  and  indeed  one  of our \nreasons  for  investigating it  is  an  association  with  a  U.S . manufacturing  company \n\n\fMemory-based  Stochastic  Optimization \n\n1067 \n\nthat has many new  examples of stochastic optimization problems every  year. \n\nThe  discrete  version  of this  problem,  in  which  u  is  chosen  from  a  discrete  set, \nis  the  well  known  k-armed  bandit  problem.  Reinforcement  learning  researchers \nhave  recently  applied  bandit-like  algorithms  to  efficiently  optimize  several  dis(cid:173)\ncrete  problems  [Kaelbling,  1990,  Greiner  and  Jurisica,  1992,  Gratch  et  al.,  1993, \nMaron  and  Moore,  1993].  This  paper  considers  extensions  to  the continuous case \nin  which  u  is  a  vector  of reals.  We  anticipate useful  applications here  too.  Conti(cid:173)\nnuity implies a formidable number of arms (uncountably infinite)  but permits us to \nassume smoothness of E[y I u]  as  a function  of u. \nThe most popular current  techniques  are: \n\n\u2022  Response  Surface  Methods  (RSM).  Current  RSM  practice  is  described  in \nthe classic  reference  [Box  and Draper,  1987].  Optimization proceeds  by  cautious \nsteepest  ascent  hill-climbing.  A region of interest  (ROI) is  established at a start(cid:173)\ning point and experiments are made at  positions within the region that can  best \nbe used  to identify the function  properties  with low-order polynomial regression. \nA large portion of the RSM literature concerns  experimental design-the decision \nof where  to take data points in order  to acquire  the lowest  variance estimate of \nthe  local  polynomial  coefficients  in  a  fixed  number  of experiments.  When  the \ngradient  is  estimated  with  sufficient  confidence,  the  ROI  is  moved  accordingly. \nRegression of a quadratic locates optima within the ROI and also diagnoses ridge \nsystems and saddle  points. \nThe strength of RSM is that it is careful not to change operating conditions based \non  inadequate evidence,  but moves once  the data justifies.  A  weakness  of RSM \nis  that  human judgment is  needed:  it  is  not  an  algorithm, but a  manufacturing \nmethodology . \n\n\u2022  Stochastic Approximation methods. The algorithm of [Robbins and Monro, \n1951] does root finding without the use of derivative estimates.  Through the use of \nsuccessively  smaller steps convergence  is  proven  under  broad assumptions about \nnoise.  Keifer-Wolfowitz  (KW)  [Kushner  and  Clark,  1978]  is  a  related  algorithm \nfor  optimization  problems.  From  an  initial  point  it  estimates  the  gradient  by \nperforming an  experiment  in  each  direction  along  each  dimension  of the  input \nspace.  Based on the estimate, it moves its experiment center and repeats.  Again, \nuse of decreasing  step sizes  leads to a  proof of convergence  to a  local optimum. \nThe strength of KW is its aggressive exploration, its simplicity, and that it comes \nwith  convergence  guarantees.  However,  it  has  more  of a  danger  of attempting \nwild  experiments  in  the  presence  of noise,  and  effectively  discards  the  data  it \ncollects  after each  gradient  estimate is  made.  In  practice,  higher order  versions \nof KW  are  available in  which  convergence  is  accelerated  by  replacing  the  fixed \nstep  size  schedule  with  an  adaptive  one  [Kushner  and  Clark,  1978].  Later  we \ncompare the performance of our  algorithms to such  a  higher-order KW. \n\n2  MEMORY-BASED  OPTIMIZATION \nNeither KW nor RSM uses old data.  After a gradient has been identified the control \nu  is  moved  up  the  gradient  and  the  data that  produced  the  gradient  estimate  is \ndiscarded.  Does this lead to inefficiencies in operation?  This paper investigates one \nway  of using old data:  build a  global non-linear plant model  with it. \n\nWe use locally weighted regression to model the system [Cleveland and Delvin, 1988, \nAtkeson,  1989,  Moore,  1992].  We  have  adapted  the  methods to  return  posterior \ndistributions for  their coefficients  and noise  (and thus,  indirectly, their predictions) \n\n\f1068 \n\nA. W. MOORE, J. SCHNEIDER \n\nbased on very broad priors, following the Bayesian methods for  global linear regres(cid:173)\nsion described  in  [DeGroot,  1970]. \nWe estimate the coefficients  f3  = {,8I  ... ,8m}  of a  local  polynomial model in which \nthe  data  was  generated  by  the  polynomial and  corrupted  with  gaussian  noise  of \nvariance  u2,  which  we  also  estimate.  Our  prior  assumption  will  be  that  f3  is  dis(cid:173)\ntributed  according to a  multivariate gaussian of mean 0  and covariance matrix E. \nOur prior on  u  is  that  1/u2  has a gamma distribution with parameters a  and ,8. \n\nAssume  we  have  observed  n  pieces  of data.  The  jth polynomial term  for  the  ith \ndata  point  is  Xij  and  the  output  response  of the  ith  data  point  is  Ii.  Assume \nfurther  that we  wish to estimate the model local to the query  point  X q ,  in which a \ndata point  at distance  di  from  the  the query  point  has  weight  Wi  = exp( -dl! K). \nK, the kernel  width is  a fixed  parameter that determines the degree  of localness  in \nthe local  regression.  Let  W  =  Diag(wl,w2 . .. Wn) . \nThe marginal posterior distribution of f3  is' a t distribution with mean 13  = (E- 1 + \nX T W 2X)-1(XT W 2y) covariance \n\n(2,8 + (yT -\n\nf3T XT)W2yT)(E-l + X T W 2 X)-l /  (2a + I:~=l wi) \n\n(1) \n\nand a + I:~=l w'f  degrees  of freedom. \nWe  assume  a  wide,  weak,  prior  E  = Diag(202,202, ... 202), a  = 0.8,,8  = 0.001, \nmeaning the prior assumes each  regression  coefficient  independently  lies  with  high \nprobability in  the range -20  to  20,  and the noise  lies  in the range 0.01  to 0.5. \n\nBriefly,  we  note  the  following  reasons  that  Bayesian  locally  weighted  polynomial \nregression  is  particularly suited to this application: \n\n\u2022  We  can  directly  obtain  meaningful  confidence  estimates  of the joint pdf of the \nregressed coefficients and predictions.  Indirectly, we  can compute the probability \ndistribution of the steepest gradient, the location of local optima and the principal \ncomponents of the local  Hessian. \n\n\u2022  The Bayesian approach allows meaningful regressions even with fewer data points \nthan  regression  coefficients-the  posterior distribution  reveals  enormous lack  of \nconfidence  in some aspects of such  a  model but other useful  aspects  can  still be \npredicted  with  confidence.  This is  crucial  in  high dimensions,  where  it  may be \nmore  effective  to  head  in  a  known  positive  gradient  without  waiting for  all  the \nexperiments that would  be needed  for  a  precise estimate of steepest  gradient. \n\n\u2022  Other pros  and cons of locally weighted  regression  in  the context of control can \n\nbe found  in  [Moore  et  ai.,  1995]. \n\nGiven  the  ability  to derive  a  plant  model from  data,  how  should it  best  be  used? \nThe  true  optimal  answer,  which  requires  solving  an  infinite-dimensional  Markov \ndecision  process,  is  intractable.  We  have  developed  four  approximate  algorithms \nthat use  the learned  model , described  briefly below. \n\n\u2022  AutoRSM. Fully automates the  (normally manual)  RSM  procedure  and incor(cid:173)\n\nporates weighted data from the model; not only from the current  design.  It uses \nonline  experimental  design  to  pick  ROI  design  points  to  maximize information \nabout local gradients and optima.  Space does not permit description of the linear \nalgebraic formulations of these questions. \n\n\u2022  PMAX. This is a greedy, simpler approach that uses the global non-linear model \nfrom the data to jump immediately to the model optimum. This is similar to the \ntechnique  described  in  [Botros,  1994],  with  two  extensions.  First,  the  Bayesian \n\n\fMemory-based  Stochastic  Optimization \n\n1069 \n\nFigure  1:  Three  examples \nof 2-d functions used in op(cid:173)\ntimization  experiments \n\npriors  enable  useful  decisions  before  the  regression  becomes  full-rank.  Second, \nlocal quadratic models permit second-order  convergence near an optimum. \n\n\u2022  IEMAX.  Applies  Kaelbling's  IE  algorithm  [Kaelbling,  1990]  in  the continuous \n\ncase  using Bayesian confidence  intervals. \n\nllchosen  = \n\nargmax;  () \n\nJ opt  U \n\nu \n\n(2) \n\nwhere  iopt(u) is the top of the 95th %-ile confidence  interval.  The intuition here \nis that we  are encouraged to explore more aggressively than PMAX, but will not \nexplore areas that are confidently  below the best known optimum . \n\n\u2022  COMAX. In a  real plant we  would never want to apply PMAX or IEMAX.  Ex(cid:173)\n\nperiments must be cautious for  reasons of safety, quality control, and managerial \npeace of mind.  COMAX extends IEMAX  thus: \n\nllchosen  = \n\nargmax \nu  E  SAFE \n\nA \n\nfopt(u);U E SAFE{=} f,pess(U)  > dIsaster threshold \n\nA\n\n. \n\n(3) \n\nAnalysis of these  algorithms is  problematic unless  we  are  prepared  to make strong \nassumptions about  the  form  of E[Y  I u].  To examine the  general  case  we  rely  on \nMonte Carlo simulations, which we  now  describe. \n\nThe experiments used  randomly generated nonlinear unimodal (but not  necessarily \nconvex)  d-dimensional functions from  [0, l]d -+ [0,1].  Figure 1 shows three example \n2-d  functions.  Gaussian  noise  (0- =  0.1)  is  added  to  the  functions.  This  is  large \nnoise,  and means several function evaluations would be needed  to achieve a  reliable \ngradient estimate for  a  system using even  a  large step size such  as  0.2. \n\nThe following optimization algorithms were  tested  on a  sample of such functions. \n\nVary-KW \n\nFixed-KW \n\nThe best performing KW algorithm  we could find  varied step size and \nadapted gradient  estimation steps to avoid  undue  regret  at optima. \nA  version  of  KW  that  keeps  its  gradient-detecting  step  size  fixed. \nThis risks  causing  extra regret  at a  true  optima,  but  has less  chance \nof becoming delayed  by  a non-optimum. \nThe best performing  version  thereof. \n\nAuto-RSM \nPasslve-RSM  Auto-RSM  continues  to  identify  the precise  location  of the  optimum \nwhen  it's arrived  at that optimum.  When  Passive-RSM  is  confident \n(greater than 99%)  that it knows  the  location  of the optimum  to two \nsignificant  places,  it stops experimenting. \nA linear instead of quadratic model, thus restricted to steepest ascent. \nAuto-RSM  with  conservative  parameters,  more  typical  of those  rec-\nommended in the RSM  literature. \n\nLinear RSM \nCRSM \n\nPmax,  IEmax  As described  above. \nand Comax \n\nFigures  2a and  2b  show  the  first  sixty  experiments  taken  by  AutoRSM  and  KW \nrespectively on their journeys to the goal. \n\n\f1070 \n\n(a) \n\n(b) \n\nA. W.  MOORE, J.  SCHNEIDER \n\nFigure  2a:  The  path  taken  (start  at \n(0.8,0.2))  by  AutoRSM  optimizing  the \ngiven  function  with  added  noise  of stan(cid:173)\ndard  deviation  0.1 at each  experiment. \n\nFigure  2b:  The  path  taken  (start  at \n(0.8,0.2)) by  KW.  KW's path looks decep(cid:173)\ntively  bad,  but remember it  is  continually \nbuffeted  by  considerable  noise. \n\n(0) RetroI_d ............ _ \n\nte) No. of\", YntII wllhln 0,05 rtf optimum \n\nlet)  ............ of FINAL . . .  tepe \n\nFigure  3:  Comparing nine  stochastic optimization  algorithms  by  four criteria:  (a) Regret , \n(b) Disasters, (c)  Speed to converge (d) Quality at convergence.  The partial order depicted \nshows  which  results are significant  at the 99% level  (using blocked  pairwise  comparisons). \nThe  outputs  of  the  random  functions  range  between  0-1  over  the  input  domain.  The \nnumbers in the boxes are means over fifty  5-d functions.  (a) Regret is defined as the mean \nYopt  - Yi-the  cost  incurred  during  the  optimization  compared  with  performance  if  we \nhad known  the optimum location and  used  it from  the  beginning.  With  the exception  of \nIEMAX,  model-based  methods  perform  significantly  better  than  KW,  with  reduced  ad(cid:173)\nvantage for cautious and linear methods.  (b) The %-age of steps which  tried experiments \nwith  more  than 0.1  units  worse  performance  than at  the  search start.  This  matters  to a \nrisk averse manager.  AutoRSM has fewer than 1%  disasters,  but COMAX and the  model(cid:173)\nfree  methods do  better still.  PMAX's aggressive  exploration  costs it.  (c)  The number of \nsteps until  we  reach  within 0.05  units  of optimal.  PMAX 's aggressiveness  wins.  (d)  The \nquality  of the  \"final\"  solution  between steps 50 and 60  of the optimization. \n\nResults for  50  trials of each optimization algorithms for  five-dimensional  randomly \ngenerated  functions  are  depicted  in  Figure  3.  Many other  experiments  were  per(cid:173)\nformed  in  other  dimensionalities  and  for  modified  versions  of the  algorithm,  but \nspace does  not permit detailed discussion  here. \n\nFinally we  performed experiments  with  the simulated power-plant  process  in  Fig(cid:173)\nure  4.  The catalyst  controller  adjusts  the flow  rate  of the  catalyst  to  achieve  the \ngoal  chemical  A  content.  Its  actions  also  affect  chemical  B  content.  The  tem(cid:173)\nperature  controller  adjusts  the  reaction  chamber  temperature  to  achieve  the  goal \nchemical B  content .  The chemical contents are  also affected  by the flow  rate  which \nis determined  externally by demand for  the  product. \n\nThe  task  is  to find  the optimal values for  the  six  controller  parameters  that  min(cid:173)\nimize the  total squared  deviation  from  desired  values of chemical  A  and chemical \nB  contents.  The  feedback  loops from  sensors  to  controllers  have  significant  delay. \nThe controller  gains  on  product  demand  are  feedforward  terms  since  there  is  sig(cid:173)\nnificant  delay  in the effects  of demand on the  process.  Finally, the  performance of \nthe system may also depend on variations over time in the composition of the input \nchemicals which can  not be directly sensed. \n\n\fMemory-based  Stochastic  Optimization \n\n1071 \n\nCatalyst Supply \n\nRaw \nInput \nChemicals \n\nOptimize 6 Controller Parameters \n\nTo Minimize Squared Deviation \nfrom Goal Chemical A and B Content \n\nSensor A \n\nProduct \nDemand \n\nTe \n\nbase lenns: \nrccdback term,,; \n\nBase temperature \nSen.for B gain \n\nf'L-_---'----'----~orwvd tem\\s:  Product demand gain \n\nCatalyst Controller \nBase input rate \nSensor A gain \nProduct demand gain \n\nREACTION \nCHAMBER \n\nPumps governed by \ndemand for product \n\nFigure 4:  A  Simulated \nChemical Process \n\nChemical A \ncontent sensor \n\nChemical B \ncontent sensor \n\nProduct \noutput \n\nThe total summed regrets of the optimization methods on 200 simulated steps were: \n\nStay AtStart \n\n10.86 \n\nFixedKW \n\n2.82 \n\nAutoRSM \n\n1.32 \n\nPMAX \n\n3.30 \n\nCOMAX \n\n4.50 \n\nIn this case AutoRSM is best, considerably beating the best KW algorithm we could \nfind.  In  contrast  PM AX  and  COMAX  did  poorly:  in  this  plant  wild  experiments \nare  very  costly  to PMAX  and  COMAX  is  too cautious.  Stay AtStart  is  the  regret \nthat would be incurred if all 200  steps  were  taken  at the initial parameter setting. \n\n3  UNOBSERVED DISTURBANCES \nAn apparent danger of learning a  model is  that if the environment changes, the out \nof date  model will  mean poor  performance and  very  slow  adaptation.  The model(cid:173)\nfree  methods,  which  use  only  recent  data,  will  react  more  nimbly.  A  simple  but \nunsatisfactory answer to this is  to use  a model that implicitly (e.g.  a neural net) or \nexplicitly (e.g.  local weighted regression  of the fifty  most recent  points) forgets.  An \ninteresting  possibility is  to  learn  a  model  in  a  way  that  automatically determines \nwhether  a disturbance has occurred,  and if so,  how  far  back  to forget. \nThe  following  \"adaptive  forgetting\"  (AF)  algorithm  was  added  to  the  AutoRSM \nalgorithm:  At  each  step,  use  all  the  previous  data  to  generate  99%  confidence \nintervals on  the output value at the current  step.  If the observed  output is  outside \nthe  intervals  assume  that  a  large  change  in  the system  has  occured  and forget  all \nprevious data.  This algorithm is good for  recognizing jumps in the plant's operating \ncharacteristics and allows AutoRSM to respond to them quickly, but is not suitable \nfor  detecting and handling process  drift. \n\nWe  tested  our  algorithm's performance on the  simulated plant for  450  steps.  Op(cid:173)\neration  began  as  before,  but  at  step  150  there  was  an  unobserved  change  in  the \ncomposition  of the  raw  input  chemicals.  The  total  regrets  of the  optimization \nmethods were: \n\nStayAtStart  FixedKW  AutoRSM  PMAX  AutoRSM/AF \n\n11.90 \n\n5.31 \n\n8.37 \n\n9.23 \n\n2.75 \n\nAutoRSM and PMAX do poorly because  all their decisions  after step 150 are based \npartially on  the  invalid data collected  before  then.  The  AF  addition to AutoRSM \nsolves  the  problem  while  beating  the  best  KW  by  a  factor  of  2.  Furthermore, \nAutoRSMj AF  gets  1.76  on  the  invariant task,  thus  demonstrating that  it  can  be \nused safely in cases where  it is  not known  if the process  is  time varying. \n\n\f1072 \n\nA.  W.  MOORE, J. SCHNEIDER \n\n4  DISCUSSION \nBotros'  thesis  [Botros,  1994]  discusses  an  algorithm  similar  to  PMAX  based  on \nlocal  linear  regression.  [Salganicoff and  Ungar,  1995]  uses  a  decision  tree  to  learn \na  model.  They  use  Gittins  indices  to  suggest  experiments:  we  believe  that  the \nmemory-based  methods  can  benefit  from  them  too.  They,  however,  do  not  use \ngradient  information, and so  require  many experiments to search  a  2D  space. \nIEmax performed badly in these experiments, but optimism-gl1ided exploration may \nprove important in  algorithms which  check  for  potentially superior  local optima. \n\nA possible extension is self tuning optimization.  Part way through an optimization, \nto estimate the best  optimization parameters for  an  algorithm we  can  run  monte(cid:173)\ncarlo  simulations which  run  on  sample functions  from  the  posterior  global  model \ngiven  the current  data. \n\nThis paper has examined the question of how much can learning a Bayesian memory(cid:173)\nbased  model  accelerate  the  convergence  of stochastic  optimization .  We  have  pro(cid:173)\nposed four  algorithms for doing this, one based on an autonomous version  of RSM; \nthe  other  three  upon  greedily  jumping to  optima of three  criteria  dependent  on \npredicted  output  and  uncertainty.  Empirically the  model-based  methods  provide \nsignificant gains over  a  highly tuned  higher order  model-free method. \n\nReferences \n[Atkeson,  1989]  C .  G .  Atkeson.  Using  Local  Models  to  Control  Movement .  In  Proceedings  of Neural \n\nInformation Processing  Systems  Conference,  November  1989. \n\n[Botros,  1994]  S .  M .  Botros.  Model-Based Techniques in  Motor Learning and Task  Optimization .  PhD. \n\nThesis,  MIT Dept.  of Brain  and  Cognitive Sciences,  February  1994 . \n\n[Box  and  Draper,  1987]  G.  E .  P.  Box  and  N .  R.  Draper .  Empirical  Model-Building  and  Response \n\nSurfaces.  Wiley ,  1987. \n\n[Cleveland  and  Delvin ,  1988]  W.  S.  Cleveland  and  S.  J .  Delvin.  Locally  Weighted  Regression :  An  Ap(cid:173)\n\nproach  to  Regression  Analysis  by  Local  Fitting.  Journal  of the  American  Statistical  Association, \n83(403):596-610, September  1988. \n\n[DeGroot,  1970]  M.  H .  DeGroot .  Optimal Statistical Decisions.  McGraw-Hill,  1970. \n[Gratch  et  al. , 1993]  J.  Gratch ,  S.  Chien,  and  G.  DeJong.  Learning  Search  Control  Knowledge  for \nDeep Space  Network Scheduling.  In  Proceedings  of the  10th International  Conference  on  Machine \nLearning.  Morgan  Kaufmann,  June  1993. \n\n[Greiner and  Jurisica,  1992]  R.  Greiner and I. Jurisica.  A  statistical approach to solving the EBL utility \nproblem.  In  Proceedings  of the  Tenth  International  Conference  on  Artificial Intelligence  (AAAI-\n92). MIT Press,  1992. \n\n[Kaelbling,  1990]  L .  P .  Kaelbling.  Learning in  Embedded  Systems.  PhD . Thesis ; Technical  Report  No. \n\nTR-90-04, Stanford  University,  Department  of Computer Science,  June  1990. \n\n[Kushner  and  Clark,  1978]  H .  Kushner  and  D.  Clark .  Stochastic  Approximation  Methods  for  Con (cid:173)\n\nstrained  and  Unconstrained  Systems.  Springer-Verlag,  1978. \n\n[Maron  and  Moore , 1993]  O.  Maron  and  A .  Moore.  Hoeffding  Races:  Accelerating  Model  Selection \nSearch for Classification and Function Approximation . In  Advances in Neural Information Processing \nSystems  6.  Morgan  Kaufmann,  December 1993. \n\n[Moore  et  al .,  1995]  A .  W .  Moore ,  C .  G .  Atkeson,  and  S.  Schaal.  Memory-based  Learning  for  Con(cid:173)\n\ntrol.  Technical  report,  CMU Robotics Institute,  Technical  Report  CMU-RI-TR-95-18  (Submitted for \nPublication) , 1995. \n\n[Moore,  1992]  A.  W .  Moore.  Fast,  Robust  Adaptive  Control  by  Learning  only  Forward  Models .  In \nJ .  E .  Moody,  S.  J .  Hanson , and R .  P .  Lippman,  editors,  Advances in  Neural Information Processing \nSystems 4.  Morgan  Kaufmann,  April  1992 . \n\n[Robbins  and  Monro,  1951]  H .  Robbins  and  S.  Monro.  A  stochastic approximation  method .  Annals  of \n\nMathematical Statist2cs , 22 :400-407, 1951. \n\n[Salganicoff and  Ungar,  1995]  M.  Salganicoffand L . H.  Ungar.  Active Exploration and  Learning in Real(cid:173)\n\nValued Spaces  using Multi-Armed Bandit Allocation Indices.  In Proceedings of the 12th International \nConference  on  Machine  Learning. Morgan  Kaufmann ,  1995 . \n\n\f", "award": [], "sourceid": 1124, "authors": [{"given_name": "Andrew", "family_name": "Moore", "institution": null}, {"given_name": "Jeff", "family_name": "Schneider", "institution": null}]}