{"title": "Acquisition in Autoshaping", "book": "Advances in Neural Information Processing Systems", "page_first": 24, "page_last": 30, "abstract": null, "full_text": "Acquisition in Autoshaping \n\nSham Kakade \n\nPeter Dayan \n\nGatsby Computational Neuroscience Unit \n\n17 Queen Square, London, England, WC1N 3AR. \n\nsharn@gatsby.ucl.ac.uk \n\ndayan@gatsby.ucl.ac.uk \n\nAbstract \n\nQuantitative data on the speed with which animals acquire behav(cid:173)\nioral responses during classical conditioning experiments should \nprovide strong constraints on models of learning.  However, most \nmodels have simply ignored these data; the few that have attempt(cid:173)\ned to address them have failed by at least an order of magnitude. \nWe discuss key data on the speed of acquisition, and show how to \naccount for them using a statistically sound model of learning, in \nwhich differential reliabilities of stimuli playa crucial role. \n\n1  Introduction \n\nConditioning experiments probe  the ways that animals make predictions about \nrewards and punishments and how those predictions are used to their advantage. \nSubstantial quantitative data are available as to how pigeons and rats acquire con(cid:173)\nditioned  responses  during autoshaping,  which  is  one  of the  simplest paradigms \nof classical conditioning.4  These data are revealing about the statistical, and ulti(cid:173)\nmately also the neural, substrate underlying the ways that animals learn about the \ncausal texture of their environments. \nIn autoshaping experiments on pigeons,  the  birds acquire a  peck response to  a \nlighted key associated (irrespective of their actions) with the delivery of food.  One \nattractive feature of autoshaping is that there is no need for separate 'probe trials' \nto assess the degree of association formed between the light and the food by the \nanimal- rather, the rate of key pecking during the light (and before the food) can \nbe used as a direct measure of this association. In particular, acquisition speeds are \noften measured by the number of trials until a certain behavioral criterion is met, \nsuch as pecking during the light on three out of four successive trials.4,8,10 \nAs  stressed persuasively by Gallistel  &  Gibbon4  (GG;  forthcoming),  the  critical \nfeature of autoshaping is that there is substantial experimental evidence on how \nacquisition speed depends on the three critical variables shown in figure 1A. The \nfirst is I, the inter-trial interval; the second is T, the time during the trial for which \nthe light is presented; the third is the training schedule, liS, which is the fractional \nnumber of deliveries per light -\nFigure 1 makes three key points. First, figure 1B shows that the median number of \ntrials to the acquisition criterion depends on the ratio  of I IT, and not on I  and T \nseparately - experiments reported for the same I IT are actually performed with I \nand T  differing by more than an order of magnitude.4,8 Second, figure 1B shows \nconvincingly that the number of reinforcements is approximately inversely pro(cid:173)\nportional to I IT -\nthe relatively shorter presentation of light, the faster the leam-\n\nsome birds were only partially reinforced. \n\n\fAcquisition in Autoshaping \n\n25 \n\nA \n\nB  500 . \nE \n\n.. .  . \n\n'OOOO r - - - - - -- - - ,  \n\nC \n\nvrnotoo \n\\~' \n\n'''' ~ 1': \n\n\\: \\ , ':  ;: :::::~::;i~: \n\nlime \n\n2 \n\nJr  10 \n\n20 \n\n50 \n\n10 , - -- 2 \u00b7 \u00b7 :~  . 5 \n\n\u2022 \n\n\u2022  '0 \n\nFigure 1:  Autoshaping. A) Experimental paradigm. Top:  the light is presented for T seconds every C \nseconds and is always followed by the delivery of food (filled circle).  Bottom: the food is delivered with \nprobability liS = 1/2 per trial.  In some cases I  is stochastic, with the appropriate mean.  B)  Log-log \nplot4  of the number of reinforcements to a given acquisition criterion versus the I IT ratio for  S  = l. \nThe data are median acquisition times from 12 different laboratories.  C) Log-log acquisition curves for \nvarious  I IT ratios and S  values. The main graph shows trials  versus S;  the inset shows reinforcements \nversus S.  (1999). \n\ning.  Third, figure lC shows that partial reinforcement has almost no effect when \nmeasured as a function of the number of reinforcements (rather than the number \nof trials),4, 10 since although it takes S times as many trials to acquire, there are rein(cid:173)\nforcements  on only liS trials.  Changing S  does not change the effective I IT when \nmeasured as a function of reinforcements, so this result might actually be expected \non the basis of figure IB, and we only consider S = 1 in this paper. Altogether, the \ndata show that: \n\n(1) \n\nwhere n  is the number of rewards to the acquisition criterion.  Remarkably, these \neffects seem to hold for over an order of magnitude in both I IT and S . \nThese quantitative data should be a most seductive target for statistically sound \nmodels of learning.  However,  few  models have even attempted to  capture the \nstrong constraints they provide, and those that have attempted, all fail in critical \naspects.  The best of them, rate estimation theory4 (RET), is closely related to the \nRescorla-Wagnerl3 (RW) model, and actually captures the proportionality in equa(cid:173)\ntion 1.  However, as shown below, RET grossly overestimates the observed speed \nof acquisition (underestimating the proportionality constant).  Further, RET is de(cid:173)\nsigned to account for the time at which a particular, standard, acquisition criterion \nis met.  Figure 2A shows that this is revealing only about the very early stages of \nlearning - RET is silent about the remainder of the learning curve. \nWe  look  at additional quantitative  data  on learning,  which collectively  suggest \nthat stimuli compete to predict the delivery of reward.  Dayan &  Long3  (DL)  dis(cid:173)\ncussed various statistically inspired competitive models of classical conditioning, \nconcluding with one in which stimuli are differently reliable  as predictors of re(cid:173)\nward. However, DL ignored the data shown in figures 1 and 2, basing their anal(cid:173)\nYSis  on conditioning paradigms in which I IT  was not a  factor.  Figures 1 and 2 \ndemand a  more sophisticated  statistical model - building  such a  model is  the \nfocus of this paper. \n\n2  Rate Estimation Theory \n\nGallistel & Gibbon4 (GG; forthcoming) are amongst the strongest proponents of the \nquantitative relationships in figure 1. To account for them, GG suggest that animals \nare estimating the rates of rewards - one, >'1, for the rate associated with the light \nand another, >'b, for the rate associated with the background context. The context is \nthe ever-present environment which can itself gain associative value.  The overall \n\n\f26 \n\nA \n\n1.5 \n\ni\n\n' \n\nrite ' \n\n100 \n\n200 \n\n300 \nreinforcements \n\n400 \n\nS.  Kakade and P.  Dayan \n\ncl40r---~~-~-----' \n\n<Il \n\nB  :~120 \n'::; \n8\"100 \n'\" $I  80 \n~60 \nE e 40 \n~ 20 \n'~  0,L.........J~0 ---4 ~8L.J10164~12\":-8 -!\"256\"':-J.!12~00,---J \n\nprior context reinforcements \n\nFigure 2:  Additional Autoshaping Data.  A)  Acquisition of keypecking.  The figure shows response \nrate  versus  reinforcements. 6  The  acquisition  criterion  is  satisfied  at  a  relatively  early  time  when  the \nresponse curve crosses the acquisition criterion line.  B) The effects of prior context reinforcements on \nsubsequent acquisition speed. The data are taken from two experiments,I,2 with I IT = 6. \n\npredicted reward rate while the light is on is.A1 + .Ab, and the rate without the light \nis just .Ab\u00b7 \nThe additive form of the model makes it similar to  Rescorla-Wagner's13  (RW)  s(cid:173)\ntandard delta-rule model, for which the net prediction of the expected reward in a \ntrial is the sum of the associative values of each active predictor (in this case, the \ncontext and light).  If the rewards are modeled as being just present or absent, the \nexpected value for a reward is just its probability of occurrence.  Instead, RET uses \nrates, which are just probabilities per unit time. \nGG4  formulated their model from a frequentist viewpoint. However, it is easier to \ndiscuss a closely related Bayesian model which suffers from the same underlying \nproblem.  Instead of using RW's delta-rule for learning the rates, GG assume that \nreinforcements come from a constant rate Poisson process, and make sound statis(cid:173)\ntical inferences about the rates given the data on the rewards.  Using an improper \nflat prior over the rates, we can write the joint distribution as: \n\nP(.AI.Ab  I data) ex  P(n I .A1.Abtltb)  ex  (.AI  + .Ab)ne-(AI+Ab)tle-Abtb \n\n(2) \nsince all n  rewards occur with the light, at rate .AI  + .Ab.  Here, tl  =  nT is the total \ntime the light is on, and tb  =  nI is the total time the light is off. \nGG  take the further important step of relating the inferred rates .AI  and .Ab  to the \ndecision of the animals to start responding (ie  to satisfy the acquisition criterion). \nGG suggest that acquisition should occur when the animals have strong evidence \nthat the fractional increase in the reward rate, whilst the light is on, is greater than \nsome threshold.  More formally, acquisition should occur when: \n\nP \u00ab.AI  + .Ab) l.Ab  > J3  I n)  =  1 - Q \n\n(3) \nwhere  Q  is  the  uncertainty threshold and J3  is  slightly  greater than one,  reflect(cid:173)\ning the  fractional increase.  The  n  that first  satisfies equation 3  can be found by \nintegrating the joint probability in equation 2.  It turns out that n  ex  tlltb, which \nhas the approximate, linear dependence on the ratio  I IT (as in figure  IB), since \ntt/tb  =  nT InI  =  T I I.  It also  has no  dependence on partial reinforcement,  as \nobserved in figure 1 C. \nHowever, even with a very low uncertainty, Q  =  0.001, and a reasonable fractional \nincrease, J3  =  1.5, this model predicts that learning should be more than ten times \nas fast as observed, since we get n  ~ 20 * T I I as opposed to the 300 * T I I observed. \nEquation 1 can only be satisfied by setting Q  between 10-20  and 10-50  (depending \non the precise values of I IT and J3)!  This spells problems for GG as a normative, \nideal detector model of learning -\nit cannot,  for  instance, be repaired with any \nreasonable prior for the rates, as Q  drops drastically with n . In other circumstances, \n\n\fAcquisition in Autoshaping \n\n27 \n\nthough,  Gallistel,  Mark & KingS  (forthcoming)  have shown that animals can be \nideal detectors of changes in rates. \nOne hint of the flaw  with GG  is that simple manipulations to  the context before \nstarting auto shaping  (in particular extinction)  can produce very  rapid  learning.2 \nMore generally, the data show that acquisition speed is strongly- controlled by pri(cid:173)\nor rewards being given only in the context (without the light present).2  Figure 2B \nshows a parametric study of subsequent acquisition speeds during autoshaping as \na function of the number of rewards given only with the context. This effect cannot \nsimply be modeled by assuming a different prior distribution for the rates (which \ndoes not fix  the problem of the speed of acquisition in any case), since the rate at \nwhich these prior context rewards were given has little effect on subsequent ac(cid:173)\nquisition speed for a given number of prior reinforcements.9  Note that the data in \nfigure 2B  (ie  equation 1) suggest that there were about thirty prior rewards in the \nthis is consistent with the experimental procedures used,8--10  although \ncontext -\nprior experience was not a carefully controlled factor. \n\n3  The Competitive Model \n\nFive sets of constraints govern our new model.  First,  since animals can be ideal \ndetectors of rates in some circumstances,s we only consider accounts under which \ntheir acquisition of responding has a rational statistical basis. Second, the number \nof reinforcements to acquisition must be n  ~ 300 * T / I, as in equation 1.  This re(cid:173)\nquires that the constant of proportionality should come from rational, not absurd, \nuncertainties. Third, pecking rates after the acquisition criterion is satisfied should \nalso follow the form of figure 2A (in the end, we are preventing from a normative \naccount of this by a dearth of data).  Fourth, the overallieaming speed should be \nstrongly affected by the number of prior context rewards (figure 2B), but not by the \nrate at which they were presented.  That is,  the context, as an established predic(cid:173)\ntor, regardless of the rate it predicts, should be able to substantially block learning \nto a less established predictor.  Finally,  the asymptotic accuracy of rate estimates. \nshould satisfy the substantial experimental data on the intrinsic uncertainty in the \npredictions in the form of a quantitative account called scalar expectancy theory7 \n(SET). \nIn our model, as in DL, an independent prediction of the rate of reward delivery is \nmade on the basis of each stimulus that is present (wc,  for  the context; WI  for  the \nlight).  These separate predictions are combined based on estimated reliabilities of \nthe predictions. Here, we present a heuristic version of a more rigorously specified \nmodel.12 \n\n3.1  Rate Predictions \n\nSET7  was originally  developed  to  capture  the  nature of uncertainty  in the  way \nthat animals estimate time intervals. Its most important result is that the standard \ndeviation of an estimate is  consistently proportional to  the mean, even after an \nasymptotic number of presentations of the interval.  Since the estimated time to a \nreward is just the inverse rate, asymptotic rate estimates might also be expected \nto  have constant coefficients of variation.  Therefore,  we constrain the  standard \ndeviations of rate estimates not to drop below a multiple of their means. Evidence \nsuggests that this multiple is about 0.2.7  RET clearly does not satisfy this constraint \nas the joint distribution (equation 2) becomes arbitrarily accurate over time. \nInspired  by  Sutton,14  we  consider  Kalman  filter  models  for  independent  log(cid:173)\npredictions, logwc(m) and logwl(m), on trial m.  The output models for the filters \n\n\f28 \n\ns.  Kakade and P.  Dayan \n\nspecify the relationship between the predicted and observed rates. We use a simple \nlog-normal, CN, approximation (to an underlying truly Poisson model): \n\nP(oc(m) I wc(m\u00bb \n\n,...  CN(wc(m) , v;)  P(ol(m) I wl(m)) ,...  CN(wl(m), vt) \n\n(4) \nwhere o.(m) is the observed average reward whilst predictor * is present, so if a \nreward occurs with the light in trial m, then ol(m) =  l/T and oc(m)  =  l/C (where \nC  =  T  + J).  The values of v; can be determined, from the Poisson model, to be \nV 2  - v2  -1 \nc  -\n\u2022 \nThe other part of the Kalman filter is a model of change in the world for the w's: \n\nI  -\n\nlogwc(m)  =  logwc(m - 1) + \u20acc(m) \nlogwl(m)  =  log WI (m - 1) + \u20acl(m) \n\n\u20acc(m) \n\u20acl(m) \n\n,... N(O, (1](1] + 1\u00bb-1) \n,... N(O, (1](1] + 1\u00bb-1) \n\n(5) \n(6) \n\nWe  use log(rates) so that there is no inherent scale to change in the world.  Here, \n1]  is a  constant chosen to satisfy the SET constraint,  imposed as u.  = w./..,fii at \nasymptote.  Notice  that 1]  acts as  the effective number of rewards remembered, \nwhich will be less than 30, to get the observed coefficient of variation above 0.2. \nAfter observing the data from m trials, the posterior distributions for  the predic(cid:173)\ntions will become approximately: \n\nP(wc(m) I data)  '\" N(1/C,u;(m\u00bb \n\n(7) \nand, in about m  =  1]  trials,  uc(m)  -+  (1/C)/..,fii and ul(m)  -+  (l/T)/..,fii.  This \ncaptures the fastest acquisition in figure 2, and also extinction. \n\nP(wl(m) I data) ,... N(1/T, ut(m\u00bb \n\n3.2  Cooperative Mixture of Experts \n\nThe two predictions (equation 7) are combined using the factorial experts model of \nJacobs et a[11  that was also used by DL. For this, during the presentation of the light \n(and the  context,  of course),  we consider  that,  independently,  the  relationships \nbetween the actual reward rate rem) and the outputs wl(m) and wc(m)  of 'experts' \nassociated with each stimulus are: \n\nP(wl(m)lr(m\u00bb  '\" N(r(m), pJm) \n\n,  P(wc(m)lr(m\u00bb,... N(r(m), p)m) \n\n(8) \n\nwhere PI(m)-1  and pc(m)-1  are inverse variances,  or reliabilities  for  the stimuli. \nThese reliabilities reflect the belief as to how close wl(m)  and wc(m)  are to rem). \nThe estimates are combined, giving \n\nP(r(m) I wl(m),wc(m\u00bb  '\" N(T(m) , (Pl(m) + pc(m\u00bb-I) \n\nrem)  = 7f1(m)wl(m) + (1- 7f1(m))wc(m)  7f1(m)  = Pl(m)/(Pl(m) + pc(m\u00bb \n\nThe prediction of the reward rate without the light r c (m) is determined just by the \ncontext value wc(m). \nIn this formulation, the context can block the light's prediction if it is more reliable \n(Pc\u00bb PI), since 7f1  ~ 0, making the mean rem)  ~ wc(m), and this blocking occurs \nregardless of the context's rate,wc(m). If PI  slowly increases, then rem)  -+ WI  slowly \nas 7f1 (m)  -+  1.  We expect this to model the post-acquisition part of the learning \nshown in figure 2A. \nA fully normative model of acquisition would come from a statistically correct ac(cid:173)\ncount of how the reliabilities should change over time, which, in turn, would come \nfrom a statistical model of the expectations the animal has of how predictabilities \nchange in the world. Unfortunately, the slow phase of learning in figure 2A, which \nshould provide the most useful data on these expectations, is almost ubiquitously \n\n\fAcquisition in Autoshaping \n\n29 \n\nA  1.5 \n\nB  lIT \n\n...__-==---1  C  500. \n\n0.6 \n\n0.3 \n\nacquiS110n Criterion \n\n10 \n\nI \n\n.: \n.. ... . \n\n\u2022 \n\n.  ;  I ' .  . . , \n\n100 \n\n200 \n\n300 \nreinforcements \n\n400 \n\n100 \n\n200 \n\n300 \nreinforcements \n\n400 \n\n10 \n\n20 \n\n50 \n\n5 \n1fT \n\nFigure 3: Satisfaction of the Constraints. A) The fit to the behavioral response curve (figure 2B), using \nequation 9 and 7r0  = 0.004.  B)  Possible acquisition curves showing r{m) versus  m.  The  +--7 on the \ncriterion line denotes the range of 15  to  120 reinforcements that are  indicated by figure  2B.  The -(cid:173)\ncurve is  the same as in Fig 3A. The parameters displayed are values for  7r0  in multiples of 7r0  for  the \ncenter curve. C) A theoretical fit to the data using equation 11. Here,o: = 5% and 7ro..jPo = 0.004. \n\nignored in experiments. We therefore make two assumptions about this, which are \nchosen to fit the acquisition data, but whose normative underpinnings are unclear. \nThe first assumption, chosen to obtain the slow learning curve, is that: \n\n1ft (m)  =  tanh 1fom \n\n(9) \nAssuming that the strength of the behavioral response is approximately propor(cid:173)\ntional to r(m) - rc(m), which we will estimate by 1fl(m)(i~h(m) - wc(m)), figure 3A \ncompares the rate of key pecking in the model with the data from figure 2A. Fig(cid:173)\nure 3B shows the effect on the behavioral response of varying 1fo.  Within just a half \nan order magnitude of variation of 1fo, the acquisition speeds (judged at the criteri(cid:173)\non line shown) due to between 1200 and 0 prior context rewards (figure 2B) can be \nthe actual reward rate \nobtained. Note the slightly counter-intuitive explanation -\nassociated with the light is established very quickly -\nslow learning comes from \nslow changes in the importance paid to these rates. \nWe make a second assumption that the coefficient of variation of the context's pre(cid:173)\ndiction, from equation 8,  does not change Significantly for  the early trials before \nthe acquisition criterion is met (it could change thereafter). This gives: \n\npc(m)  ~ po/wc(m)2  for early m \n\n(10) \nIt is plausible that the context is not becoming a relatively worse 'expert' for early \nm, since no other predictor has yet proven more reliable. \nFollowing  GG's  suggestion,  we  model  acquisition  as  occurring  on  trial  m  if \nP(r(m)  >  rc(m)ldata)  ~ 1 - 0:,  ie  if the  animal has sound reasons  to expect a \nhigher reward rate with the light.  Integrating over the Kalman filter distributions \nin equation 7 gives the distribution of r(m) - rc(m)  for early mas \n\nP(r(m) - rc(m) I data) '\" N\u00ab(tanh 1fom)(1/T -\n\nl/C), (pOC2)-1) \n\nwhere O\".(m)  has dropped out due to 1ft(m)  being small at early m.  Finding the \nnumber of rewards, n, that satisfies the acquisition criterion gives: \n\nn  ~ \n\n0:  T \n-\n1foVPO  I \n\n(11) \n\nwhere the factor  of 0:  depends on the uncertainty,  0:,  used.  Figure 3C shows the \ntheoretical fit to the data. \n\n4  Discussion \n\nAlthough a noble attempt, RET fails to satisfy the strong body of constraints under \nwhich any acquisition model must labor.  Under RET,  the acquisition of respond(cid:173)\ning cannot have a rational statistical basis, as the animal's modeled uncertainty in \n\n\f30 \n\nS.  Kakade and P.  Dayan \n\nthe association between light and reward at the time of acquisition is below 10-20 . \nFurther, RET  ignores constraints set forth by the data establishing SET  and also \ndata on prior context manipulations.  These latter data show that the context, re(cid:173)\ngardless of the rate it predicts, will substantially block learning to a less established \npredictor. Additive models, such as RET, are unable to capture this effect. \nWe have suggested a  model in which each stimulus is like an 'expert' that learns \nindependently about the world.  Expert predictions can adapt quickly to changes \nin contingencies, as they are based on a Kalman filter model, with variances chosen \nto satisfy the constraint suggested by SET, and they can be combined based on their \nreliabilities.  We have demonstrated the model's close fit to substantial experimental \ndata.  In particular, the new model captures the I IT dependence of the  number \nof rewards to acquisition, with a constant of proportionality that reflects rational \nstatistical beliefs.  The slow learning that occurs in some circumstances, is due to \na slow change in the  reliabilities of predictors, not due to the rates being unable \nto  adapt quickly.  Although  we have not shown it here,  the  model  is  also  able \nto  account  for  quantitative  data as  to  the  speed of extinction of  the  association \nbetween the light and the reward. \nThe  model leaves  many directions  for  future  study.  In particular, we have not \nspecified a sound statistical basis for the changes in reliabilities given in equation(cid:173)\ns 9 and 10. Such a basis is key to understanding the slow phase of learning. Second, \nwe have not addressed data from more sophisticated conditioning paradigms. For \ninstance, overshadowing, in which multiple conditioned stimuli are similarly pre(cid:173)\ndictive of the reward, should be able to be incorporated into the model in a natural \nway. \n\nAcknowledgements \nWe are most grateful to Randy Gallistel and John Gibbon for freely sharing, prior \nto publication,  their many ideas about timing and conditioning.  We  thank Sam \nRoweis for comments on an earlier version of the manuscript.  Funding is from a \nNSF Graduate Research Fellowship (SK) and the Gatsby Charitable Foundation. \n\nReferences \n\n[1]  Balsam, PD, &  Gibbon, J (1988). Journal  of Experimental Psychology: Animal Behavior Processes, 14: \n\n401-412. \n\n[2]  Balsam, PD, & Schwartz, AL (1981). Journal of Experimental Psychology:  Animal Behavior Processes, \n\n7:  382-393. \n\n[3]  Dayan, P,  &  Long, T, (1997) Neural Information  Processing Systems, 10:117-124. \n[4]  Gallistel, CR, & Gibbon, J (1999). Time, Rate, and Conditioning.  Forthcoming. \n[5]  Gallistel, CR, Mark, TS &  King, A (1999). Is  the Rat an  Ideal Detector of Changes in Rates of Reward? \n\nForthcoming. \n\n[6]  Gamzu, ER, & Williams, DR (1973). Journal of the Experimental Analysis of Behavior, 19:225-232. \n[7]  Gibbon, J (1977).  Psychological Review  84:279-325. \n[8]  Gibbon, J,  Baldock, MD, Locurto, C, Gold, L & Terrace, HS (1977). Journal of Experimental Psychol(cid:173)\n\nogy:  Animal Behavior Processes, 3:  264-284. \n\n[9]  Gibbon, J & Balsam, P (1981). In CM Locurto, HS Terrace,  & J Gibbon, editors, Autoshaping and \n\nConditioning Theory. 219-253. New York, NY:  Academic Press. \n\n[10]  Gibbon, J, Farrell, L, Locurto, CM, Duncan, JH & Terrace, HS (1980). Animal Learning and Behavior, \n\n8:45-59. \n\n[11]  Jacobs, RA, Jordan, MI, & Barto, AG (1991).  Cognitive  Science  15:219-250. \n[12]  Kakade, S & Dayan, P (2000). In preparation. \n[13]  Rescorla, RA & Wagner, AR (1972). In AH Black & WF Prokasy, editors, Classical Conditioning II: \n\nCurrent Research and Theory, 64-69.  New York, NY:  Appleton-Century-Crofts. \n\n[14]  Sutton, R (1992).  In Proceedings of the 7th Yale  Workshop on Adaptive and Learning Systems. \n\n\f", "award": [], "sourceid": 1777, "authors": [{"given_name": "Sham", "family_name": "Kakade", "institution": null}, {"given_name": "Peter", "family_name": "Dayan", "institution": null}]}