{"title": "Predictive Q-Routing: A Memory-based Reinforcement Learning Approach to Adaptive Traffic Control", "book": "Advances in Neural Information Processing Systems", "page_first": 945, "page_last": 951, "abstract": null, "full_text": "Predictive Q-Routing:  A  Memory-based \n\nReinforcement  Learning  Approach to \n\nAdaptive Traffic  Control \n\nSamuel P.M.  Choi,  Dit-Yan Yeung \n\nDepartment of Computer Science \n\nHong  Kong  University of Science  and Technology \n\nClear Water  Bay,  Kowloon,  Hong  Kong \n\n{pmchoi,dyyeung}~cs.ust.hk \n\nAbstract \n\nIn  this  paper,  we  propose  a  memory-based  Q-Iearning  algorithm \ncalled  predictive  Q-routing  (PQ-routing)  for  adaptive  traffic  con(cid:173)\ntrol.  We attempt to address two problems encountered in Q-routing \n(Boyan  &  Littman,  1994),  namely,  the inability to fine-tune  rout(cid:173)\ning policies under  low  network  load and  the  inability to learn  new \noptimal  policies  under  decreasing  load  conditions.  Unlike  other \nmemory-based  reinforcement  learning  algorithms  in  which  mem(cid:173)\nory  is  used  to  keep  past  experiences  to  increase  learning  speed, \nPQ-routing  keeps  the  best  experiences  learned  and  reuses  them \nby  predicting  the  traffic  trend.  The  effectiveness  of  PQ-routing \nhas been  verified  under various network  topologies and traffic con(cid:173)\nditions.  Simulation  results  show  that  PQ-routing  is  superior  to \nQ-routing in terms of both learning speed and  adaptability. \n\n1 \n\nINTRODUCTION \n\nThe adaptive traffic control problem is  to devise  routing policies for  controllers (i.e. \nrouters)  operating in a  non-stationary environment to minimize the average  packet \ndelivery time.  The controllers usually have no or only very little prior knowledge  of \nthe  environment.  While  only  local  communication between  controllers  is  allowed, \nthe  controllers  must  cooperate  among  themselves  to  achieve  the  common,  global \nobjective.  Finding the optimal routing policy in such  a  distributed manner is  very \ndifficult.  Moreover,  since  the  environment  is  non-stationary,  the  optimal  policy \nvaries  with time as  a  result of changes  in network  traffic  and topology. \nIn  (Boyan  &  Littman,  1994),  a  distributed  adaptive  traffic  control  scheme  based \n\n\f946 \n\nS. P. M. CHOI, D.  YEUNG \n\non  reinforcement  learning  (RL),  called  Q-routing,  is  proposed  for  the  routing  of \npackets in networks  with dynamically changing traffic and topology.  Q-routing is  a \nvariant of Q-Iearning  (Watkins,  1989),  which  is  an  incremental  (or  asynchronous) \nversion  of dynamic programming for  solving  multistage decision  problems.  Unlike \nthe  original  Q-Iearning  algorithm,  Q-routing is  distributed  in  the  sense  that  each \ncommunication node has  a  separate local  controller,  which  does  not  rely  on global \ninformation of the network for  decision making and refinement of its routing policy. \n\n2  EXPLORATION VERSUS  EXPLOITATION \n\nAs  in  other  RL  algorithms,  one  important issue  Q-routing  must  deal  with  is  the \ntradeoff between exploration and exploitation.  While exploration of the state space \nis  essential to learning good routing policies, continual exploration without putting \nthe learned knowledge into practice is of no  use.  Moreover,  exploration is  not  done \nat no cost.  This dilemma is well  known in the RL community and has been studied \nby  some researchers,  e.g.  (Thrun,  1992). \n\nOne  possibility is  to  divide learning into  an  exploration  phase  and  an exploitation \nphase.  The  simplest exploration  strategy  is  random  exploration,  in  which  actions \nare selected randomly without taking the reinforcement feedback into consideration. \nAfter the exploration phase,  the optimal routing policy  is simply to choose the next \nnetwork  node  with  minimum Q-value  (i.e.  minimum estimated  delivery  time).  In \nso  doing,  Q-routing is  expected  to learn  to avoid  congestion  along popular paths. \n\nAlthough  Q-routing is  able  to  alleviate congestion  along popular paths  by  routing \nsome traffic over other (possibly longer) paths, two problems are reported in (Boyan \n&  Littman,  1994).  First,  Q-routing  is  not  always  able  to  find  the  shortest  paths \nunder  low  network  load.  For  example,  if there  exists  a  longer  path  which  has  a \nQ-value  less  than  the  (erroneous)  estimate of the  shortest  path,  a  routing  policy \nthat acts  as  a  minimum selector  will  not  explore  the  shortest  path  and  hence  will \nnot  update  its  erroneous  Q-value.  Second,  Q-routing  suffers  from  the  so-called \nhysteresis  problem,  in  that  it  fails  to  adapt  to  the  optimal  (shortest)  path  again \nwhen  the network load is lowered.  Once  a longer path is  selected  due  to increase in \nnetwork load, a minimum selector is no longer able to notice the subsequent decrease \nin  traffic  along the shortest  path.  Q-routing continues  to choose  the same (longer) \npath  unless  it also  becomes  congested  and has  a  Q-value greater  than  some  other \npath.  Unless  Q-routing  continues  to  explore,  the  shortest  path  cannot  be  chosen \nagain even  though the network  load has returned  to a  very  low  level.  However,  as \nmentioned in (Boyan & Littman, 1994), random exploration may have very negative \neffects  on  congestion,  since  packets  sent  along  a  suboptimal path tend  to  increase \nqueue  delays,  slowing down  all the packets passing through  this path. \n\nInstead of having two separate phases for exploration and exploitation, one alterna(cid:173)\ntive is  to  mix them together,  with the emphasis shifting gradually from  the former \nto the latter as learning proceeds.  This can be achieved by a probabilistic scheme for \nchoosing  next  nodes.  For example,  the Q-values may be  related  to  probabilities by \nthe Boltzmann-Gibbs distribution, involving a randomness (or pseudo-temperature) \nparameter  T.  To  guarantee  sufficient  initial  exploration  and  subsequent  conver(cid:173)\ngence,  T  usually has a large initial value (giving a  uniform probability distribution) \nand decreases  towards 0 (degenerating to a deterministic minimum selector)  during \nthe  learning process.  However,  for  a  continuously operating network  with  dynami(cid:173)\ncally changing traffic  and topology, learning must be continual and hence cannot be \ncontrolled by  a  prespecified  decay  profile for  T.  An  algorithm which  automatically \nadapts  between  exploration  and exploitation is  therefore  necessary.  It is  this  very \nreason  which  led  us  to develop  the  algorithm presented  in this  paper. \n\n\fPredictive  Q-Routing \n\n947 \n\n3  PREDICTIVE  Q-ROUTING \n\nA memory-based Q-learning  algorithm called  predictive  Q-routing  (PQ-routing)  is \nproposed  here  for  adaptive traffic  control.  Unlike  Dyna  (Peng  & Williams,  1993) \nand prioritized sweeping  (Moore & Atkeson,  1993) in which memory is  used  to keep \npast experiences  to increase  learning speed,  PQ-routing keeps  the best  experiences \n(best  Q-values)  learned  and  reuses  them  by  predicting  the  traffic  trend.  The  idea \nis  as  follows.  Under  low  network  load,  the  optimal policy  is  simply  the  shortest \npath routing policy.  However,  when  the load level  increases,  packets  tend  to queue \nup  along  the shortest  paths  and the simple shortest  path routing  policy  no  longer \nperforms well.  If the  congested  paths  are  not  used  for  a  period  of time,  they  will \nrecover  and become good candidates again.  One should therefore try to utilize these \npaths  by  occasionally  sending  packets  along  them.  We  refer  to  such  controlled \nexploration  activities  as  probing.  The  probing  frequency  is  crucial,  as  frequent \nprobes will increase the load level along the already congested paths while infrequent \nprobes  will  make  the  performance  little  different  from  Q-routing.  Intuitively,  the \nprobing frequency  should depend  on  the congestion level  and the processing  speed \n(recovery  rate)  of a  path.  The  congestion  level  can  be  reflected  by  the  current \nQ-value,  but the recovery  rate has  to be estimated as  part of the learning  process. \n\nAt first  glance,  it seems that the recovery  rate can be computed simply by  dividing \nthe difference in Q-values from two probes by the elapse time.  However,  the recovery \nrate changes over time and depends on the current network traffic and the possibility \nof link/node failure.  In  addition,  the  elapse  time does  not  truly  reflect  the  actual \nprocessing  time a  path needs.  Thus this noisy  recovery  rate should be  adjusted for \nevery  packet  sent.  It is  important to  note  that  the recovery  rate  in  the  algorithm \nshould  not  be  positive,  otherwise  it  may  increase  the  predicted  Q-value  without \nbound and hence  the path can  never  be  used  again. \n\nPredictive Q-Routing Algorithm \n\nTABLES: \n\nQx(d, y)  - estimated delivery  time  from  node  x  to node  d via neighboring  node  y \nBx(d, y)  - best estimated delivery  time from  node  x  to node  d via neighboring  node  y \nRx(d,y)  - recovery  rate for  path from  node  x  to node  d via neighboring  node  y \nU x (d, y) \n\n- last  update  time for  path from  node  x  to node  d via neighboring  node  y \n\nTABLE  UPDATES:  (after  a  packet  arrives  at node  y  from  node  x) \n\n6.Q  =  (transmission  delay  + queueing  time  at  y + minz{Qy(d,z)})  - Qx(d,y) \nQx(d,y)  ~ Qx(d,y) +O\"6.Q \nBx(d, y)  ~ min(Bx(d, y), Qx(d, y)) \nif (6.Q < 0)  then \n\n6.R ~ 6.Q  /  (current  time  - Ux(d, y)) \nRx(d,y)  ~ Rx(d,y)+f36.R \n\nelse  if (6.Q > 0)  then \n\nRx(d,y) ~ -yRx(d,y) \n\nend if \nUx(d, y)  ~ current  time \n\nROUTING  POLICY:  (packet is  sent from  node  x  to node  y) \n\n6.t  =  current  time  - Ux(d,y) \nQ~(d, y) =  max(Qx(d, y) + 6.tRx(d, y), Bx(d, y)) \ny  ~ argminy{Q~(d,y)} \n\nThere  are  three  learning  parameters  in  the  PQ-routing  algorithm.  a  is  the  Q(cid:173)\nfunction learning parameter as in the original Q-learning algorithm.  In PQ-routing, \nthis parameter should  be set  to  1 or  else  the  accuracy  of the  recovery  rate  may be \n\n\f948 \n\ns. P.  M.  CHOI. D.  YEUNG \n\naffected.  f3  is  used for  learning the recovery  rate.  In our experiments,  the  value of \n0.7  is  used.  'Y  is  used  for  controlling  the  decay  of the  recovery  rate,  which  affects \nthe  probing frequency  in a  congested  path.  Its  value is  usually  chosen  to  be larger \nthan f3.  In our experiments,  the value of 0.9 is  used. \nPQ-Iearning  is  identical  to  Q-Iearning  in  the way  the  Q-function  is  updated.  The \nmajor  difference  is  in  the  routing  policy.  Instead  of selecting  actions  based  solely \non  the  current  Q-values,  the  recovery  rates  are  used  to  yield  better  estimates  of \nthe  Q-values  before  the minimum selector  is  applied.  This is  desirable  because  the \nQ-values  on  which  routing  decisions  are  based  may  become  outdated  due  to  the \never-changing  traffic. \n\n4  EMPIRICAL RESULTS \n\n4.1  A  15-NODE NETWORK \n\nTo  demonstrate  the  effectiveness  of PQ-routing,  let  us  first  consider  a  simple  15-\nnode network (Figure  1(a)) with three sources  (nodes  12  to  14) and one destination \n(node  15).  Each  node  can  process  one  packet  per  time step,  except  nodes  7 to  11 \nwhich are  two times faster  than the other nodes.  Each link is  bidirectional and has \na  transmission  delay  of one  time  unit.  It is  not  difficult  to  see  that  the  shortest \npaths  are  12  ---+  1  ---+  4  ---+  15  for  node  12,  13  ---+  2  ---+  4  ---+  15  for  node  13,  and \n14  ---+  3  ---+  4  ---+  15  for  node  14.  However,  since  each  node  along  these  paths  can \nprocess  only  one  packet  per  time step,  congestion  will  soon  occur  in  node  4  if all \nsource  nodes send  packets  along the shortest paths. \n\nOne solution  to  this  problem is  that  the source  nodes  send  packets  along  different \npaths which share no  common nodes.  For instance,  node  12  can send packets along \npath  12  ---+  1  ---+  5  ---+  6  ---+  15  while  node  13  along  13  ---+  2  ---+  7  ---+  8  ---+  9  ---+ \n10  ---+  11  ---+  15  and  node  14  along  14  ---+  3  ---+  4  ---+  15.  The optimal routing  policy \ndepends  on  the  traffic  from  each  source  node.  If the  network  load is  not too  high, \nthe  optimal routing  policy  is  to  alternate  between  the  upper  and  middle  paths in \nsending  packets. \n\n4.1.1  PERIODIC  TRAFFIC  PATTERNS  UNDER LOW  LOAD \n\nFor the convenience of empirical analysis,  we  first  consider  periodic traffic in which \neach  source  node  generates  the  same  traffic  pattern  over  a  period  of time.  Fig(cid:173)\nure  1(b) shows the average delivery time for Q-routing and PQ-routing.  PQ-routing \nperforms  better  than  Q-routing after  the initial exploration phase  (25  time steps), \ndespite of some slight oscillations.  Such oscillations are  due to the occasional prob(cid:173)\ning  activities  of the  algorithm.  When we  examine  Q-routing \u00b7more  closely,  we  can \nfind  that  after  the  initial learning,  all  the  source  nodes  try  to  send  packets  along \nthe  upper  (shortest)  path, leading to congestion in  node 4.  When  this occurs,  both \nnodes  12  and  13 switch  to the middle path, which subsequently leads to congestion \nin  node  5.  Later,  nodes  12  and  13  detect  this  congestion  and  then  switch  to  the \nlower  path.  Since  the  nodes  along  this  path  have  higher  (two  times)  processing \nspeed,  the Q-values become stable and Q-routing will stay there as  long as the load \nlevel  does  not  increase.  Thus,  Q-routing  fails  to  fine-tune  the  routing  policy  to \nimprove it.  PQ-routing,  on  the other  hand,  is  able  to learn  the recovery  rates  and \nalternate between  the upper  and  middle paths. \n\n\fPredictive  Q-Routing \n\nSource \n\nSource \n\nSource \n\n\" \n\n\" \n\n! .. l \n\n.! \n\n949 \n\n'q'\n'pq'----\n\n~ \n\nI \n\n'\" \n\n30 \n\n20 \n\n\" \n\n\\ ,'r,., .... ---;J':-~.,\"\" ...... ;-~ .. -=~,\"\"\",_~----.;-~_-.J;:-\".-. -. __ .-... ;:-\"\" ..... :-,.-.--.-... -.--.-:., . .,,= .. -.-_.-_.-.--,-... -... 1 \n\n(a)  Network \n\n\u00b7O~~2~O~\"'~~~~~~~~I~--~12-0~1\"'~~I~~~,00~~200 \n\nSlITIUM.lIonnM \n\n(b)  Periodic  traffic  patterns  under  low \nload \n\n,'-'pq' ----\n\n'q' -\n'pq'  ----\n\n30 \n\n\" \n\n20 \n\n\" \n\n10 \n\n! .. l \n\n.! \n\n\u00b0O~~~~~,~~~--~20~O--~~~-_~~~=O--~~ \n\nS,\"\"*lIonr,,,. \n\nO.~~~-2=OO--~~=-~~~~~-.=~--~7~OO--~ \n\nSunulatlonT'irl'\\tl \n\n(c)  Aperiodic \nhigh  load \n\ntraffic  patterns  under \n\n(d)  Varying  traffic  patterns  and  net(cid:173)\nwork  load \n\nFigure  1:  A  15-Node  Network  and Simulation Results \n\n4.1.2  APERIODIC  TRAFFIC  PATTERNS  UNDER HIGH  LOAD \n\nIt  is  not  realistic  to  assume  that  network  traffic  is  strictly  periodic. \nIn  reality, \nthe  time  interval  between  two  packets  sent  by  a  node  varies.  To simulate varying \nintervals  between  packets,  a  probability of 0.8  is  imposed on  each  source  node  for \nIn  this  case,  the  average  delivery  time  for  both  algorithms \ngenerating  packets. \noscillates,  Figure  1( c)  shows  the  performance of Q-routing  and  PQ-routing under \nhigh  network  load.  The  difference  in  delivery  time  between  Q-routing  and  PQ(cid:173)\nrouting becomes less  significant, as  there is less  available bandwidth in the shortest \npath for  interleaving.  Nevertheless,  it can  be  seen  that the  overall  performance  of \nPQ-routing is  still better  than  Q-routing. \n\n4.1.3  VARYING TRAFFIC  PATTERNS  AND  NETWORK  LOAD \n\nIn  the  more  complicated  situation  of  varying  traffic  patterns  and  network  load, \nPQ-routing  also  performs better  than  Q-routing.  Figure  1( d)  shows  the  hysteresis \nproblem in  Q-routing under  gradually changing  traffic  patterns  and  network  load. \nAfter  an initial exploration phase of 25  time steps,  the load level  is  set  to  medium \n\n(cid:173)\n\f950 \n\nS.  P. M. CHOI, D.  YEUNG \n\nfrom  time step  26  to  200.  From  step  201  to  500,  node  14  ceases  to send  packets \nand  nodes  12  and  13  slightly  increase  their  load  level.  In  this  case,  although  the \nshortest path becomes available again, Q-routing is not able to notice the change in \ntraffic  and  still  uses  the  same routing  policy, but  PQ-routing is  able  to  utilize  the \noptimal paths.  After  step  500,  node  13  also  ceases  to send  packets.  PQ-routing is \nsuccessful  in  adapting to the optimal path  12  -4 1 -4  4  -4  15. \n\n4.2  A  6x6  GRID NETWORK \n\nExperiments  have  been  performed  on  some  larger  networks,  including  a  32-node \nhypercube  and  some  random  networks,  with  results  similar to  those  above.  Fig(cid:173)\nures  2(b)  and  2( c)  depict  results  for  Boyan  and  Littman's 6x6  grid  network  (Fig(cid:173)\nure  2( a))  under  varying traffic  patterns and network  load. \n\nt<\" \n\n120 \n\n100 \n\neo .. \n\n20 \n\n(a)  Network \n\n'q ' -\n'pq'  ----\n\n700 \n\neoo \n\n500 \n\n... \n\n300 \n\n... \n\n100 \n\n~ \n\" 1 \n.! \n\n.~-\n'p\u00ab  \u2022\u2022 -\n\n1..ao \n\n1800 \n\n1&00 \n\n2000 \n\n800 \n\n800 \n\n1000 \n\n1200 \n\nSlmulaHonTm. \n\n\u00b0O~--2=OO~~_~--=eoo~~eoo~~I=_~~,,,,~ \n\n........... -\n\n(b)  Varying  traffic  patterns  and  net(cid:173)\nwork load \n\n(c)  Varying  traffic  patterns  and  net(cid:173)\nwork load \n\nFigure  2:  A 6x6  Grid Network  and Simulation Results \n\nIn Figure  2(b), after  an initial exploration for  50  time steps,  the load level is  set to \nlow.  From step  51  to  300,  the  load  level  increases  to  medium but  with  the  same \nperiodic  traffic  patterns.  PQ-routing  performs  slightly  better.  From  step  301  to \n1000,  the  traffic patterns  change  dramatically under  high network  load.  Q-routing \ncannot learn a stable policy in this (short)  period of time, but PQ-routing becomes \nmore  stable  after  about  200  steps.  From step  1000  onwards,  the  traffic  patterns \nchange  again  and the load level  returns  to low.  PQ-routing still performs better. \n\n\fPredictive  Q-Routing \n\n951 \n\nIn  Figure 2{ c) , the first  100  time steps  are for  initial exploration.  After this period, \npackets are sent from the bottom right part of the grid to the bottom left part with \nlow  network load.  PQ-routing  is  found  to  be  as  good  as  the shortest  path routing \npolicy,  while  Q-routing is  slightly poorer  than  PQ-routing.  From step  400  to  1000, \npackets are sent from  both the left  and right parts of the grid to the opposite sides \nat high load level.  Both the  two  bottleneck  paths become  congested  and hence  the \naverage  delivery  time increases for  both algorithms.  From time step  1000  onwards, \nthe network load decreases to a more manageable level.  We can see  that PQ-routing \nis  faster  than  Q-routing in adapting to this change. \n\n5  DISCUSSIONS \n\nPQ-Iearning is generally better than Q-Iearning under both low and varying network \nload conditions.  Under high load conditions,  they give comparable performance.  In \ngeneral,  Q-routing prefers  stable  routing  policies  and  tends  to send  packets  along \npaths with higher  processing  power,  regardless  of the  actual  packet  delivery  time. \nThis strategy is good under extremely high load conditions, but may not be optimal \nunder  other  situations.  PQ-routing,  on  the  contrary,  is  more  aggressive.  It  tries \nto  minimize  the  average  delivery  time by  occasionally  probing  the  shortest  paths. \nIf the  load level  remains extremely  high with  the patterns  unchanged,  PQ-routing \nwill gradually degenerate to Q-routing,  until the traffic changes  again .  Another ad(cid:173)\nvantage PQ-routing has over Q-routing is  that shorter adaptation time is generally \nneeded  when the traffic patterns change,  since  the routing policy of PQ-routing de(cid:173)\npends not only on  the current  Q-values but also on  the recovery  rates.  In  terms of \nmemory requirement,  PQ-routing needs  more memory for  recovery  rate estimation. \nIt should be noted, however, that extra memory is needed only for the visited states. \nIn  the worst  case, it is still in the same order as that of the original Q-routing algo(cid:173)\nrithm.  In  terms of computational cost,  recovery  rate estimation is computationally \nquite simple.  Thus  the overhead for  implementing PQ-routing should be minimal. \n\nReferences \n\nJ.A.  Boyan  &  M.L.  Littman  (1994) .  Packet routing  in  dynamically  changing  networks:  a \nreinforcement  learning  approach.  Advances in  Neural  Information  Processing Systems  6, \n671-678.  Morgan  Kaufmann,  San  Mateo,  California. \n\nM.  Littman  &  J.  Boyan  (1993) .  A  distributed  reinforcement  learning  scheme  for  net(cid:173)\nwork  routing.  Proceedings  of the  First International  Workshop  on  Applications of Neural \nNetworks to  Telecommunications,45- 51.  Lawrence  Erlbaum,  Hillsdale,  New  Jersey. \n\nA.W. Moore &  C.G. Atkeson (1993).  Memory-based reinforcement learning:  efficient com(cid:173)\nputation  with  prioritized  sweeping.  Advances  in  Neural  Information  Processing Systems \n5,  263-270 .  Morgan  Kaufmann,  San  Mateo, California. \n\nA.W.  Moore  &  C.G.  Atkeson  (1993).  Prioritized  sweeping:  reinforcement  learning  with \nless  data and less  time.  Machine  Learning,  13:103-130. \n\nJ .  Peng  &  R.J.  Williams  (1993).  Efficient  learning  and  planning  within  the  Dyna frame(cid:173)\nwork.  Adaptive Behavior, 1:437- 454. \n\nS.  Thrun  (1992).  The  role  of exploration  in  learning  control.  In  Handbook  of Intelligent \nControl:  Neural,  Fuzzy,  and Adaptive Approaches,  D.A.  White  &  D.A.  Sofge  (eds).  Van \nNostrand  Reinhold,  New  York. \n\nC.J.C.H.  Watkins  (1989).  Learning  from  delayed  rewards.  PhD  Thesis,  University  of \nCambridge,  England. \n\n\f", "award": [], "sourceid": 1096, "authors": [{"given_name": "Samuel", "family_name": "Choi", "institution": null}, {"given_name": "Dit-Yan", "family_name": "Yeung", "institution": null}]}