{"title": "Packet Routing in Dynamically Changing Networks: A Reinforcement Learning Approach", "book": "Advances in Neural Information Processing Systems", "page_first": 671, "page_last": 678, "abstract": null, "full_text": "Packet  Routing  in  Dynamically \n\nChanging Networks: \n\nA  Reinforcement  Learning  Approach \n\nJustin A.  Boyan \n\nSchool of Computer Science \nCarnegie  Mellon  University \n\nPittsburgh,  PA  15213 \n\nMichael L.  Littman\u00b7 \n\nCognitive  Science  Research  Group \n\nBellcore \n\nMorristown,  NJ  07962 \n\nAbstract \n\nThis  paper  describes  the  Q-routing  algorithm for  packet  routing, \nin  which  a  reinforcement  learning  module  is  embedded  into  each \nnode  of a  switching  network.  Only  local  communication  is  used \nby each node to keep  accurate statistics on which routing decisions \nlead  to  minimal  delivery  times.  In  simple  experiments  involving \na  36-node,  irregularly  connected  network,  Q-routing  proves  supe(cid:173)\nrior  to  a  nonadaptive  algorithm  based  on  precomputed  shortest \npaths  and  is  able  to route  efficiently  even  when  critical  aspects  of \nthe  simulation,  such  as  the  network  load,  are  allowed  to  vary  dy(cid:173)\nnamically.  The  paper  concludes  with  a  discussion  of the  tradeoff \nbetween  discovering  shortcuts  and  maintaining stable policies. \n\n1 \n\nINTRODUCTION \n\nThe  field  of reinforcement  learning  has  grown  dramatically  over  the  past  several \nyears,  but  with  the  exception  of backgammon [8,  2],  has  had few  successful  appli(cid:173)\ncations  to large-scale,  practical  tasks.  This  paper  demonstrates that  the  practical \ntask  of routing  packets  through  a  communication network  is  a  natural application \nfor  reinforcement  learning  algorithms. \n\n*Now  at  Brown  University,  Department of Computer Science \n\n671 \n\n\f672 \n\nBoyan and Littman \n\nOur \"Q-routing\"  algorithm, related to certain distributed packet routing algorithms \n[6,  7],  learns  a  routing  policy  which  balances  minimizing  the  number  of  \"hops\"  a \npacket  will  take  with  the  possibility  of congestion  along  popular  routes.  It does \nthis  by  experimenting with  different  routing policies  and gathering statistics about \nwhich  decisions  minimize total  delivery  time.  The learning is  continual and online, \nuses  only  local  information, and  is  robust  in  the face  of irregular  and  dynamically \nchanging network  connection  patterns  and load. \n\nThe  experiments  in  this  paper were  carried  out  using  a  discrete  event  simulator to \nmodel  the  transmission of packets  through  a  local  area network  and  are  described \nin  detail in  [5]. \n\n2  ROUTING  AS  A  REINFORCEMENT  LEARNING \n\nTASK \n\nA  packet  routing  policy  answers  the  question:  to  which  adjacent  node  should  the \ncurrent  node  send  its  packet  to get  it  as  quickly  as  possible  to its eventual destina(cid:173)\ntion?  Since  the policy's performance is  measured by the total time taken to deliver \na packet, there is  no  \"training signal\" for  directly evaluating or improving the policy \nuntil a packet finally reaches its destination.  However,  using reinforcement learning, \nthe policy can be  updated more quickly  and using  only local information. \n\nLet  Qx(d, y)  be  the  time  that  a  node  x  estimates  it  takes  to  deliver  a  packet  P \nbound for  node  d  by  way  of x's neighbor  node  y,  including any  time that  P  would \nhave  to  spend  in  node  x's  queue. l  Upon  sending  P  to  y,  x  immediately gets  back \ny's estimate for  the  time remaining in  the trip,  namely \n\nt = \n\n. min \n\nzEnelghbors  of y \n\nQ y ( d, z) \n\nIf the  packet  spent  q units of time in  x's  queue  and s  units of time in transmission \nbetween  nodes  x  and  y,  then  x  can  revise  its estimate as follows: \n\nLlQx(d,Y)=17(  q+s+t  - Qx(d,y)) \n\nnew  estimate \n~~ \n\nold  estimate \n\nwhere  17  is  a  \"learning  rate\"  parameter  (usually  0.5  in  our  experiments).  The  re(cid:173)\nsulting  algorithm can  be  characterized  as  a  version  of the  Bellman-Ford  shortest \npaths  algorithm  [1,  3]  that  (1)  performs  its  path  relaxation  steps  asynchronously \nand online;  and  (2)  measures path length not merely  by  number of hops  but rather \nby  total delivery  time. \nWe  call our algorithm \"Q-routing\"  and represent the Q-function Qx( d, y)  by  a large \ntable.  We also tried approximating Qx with a neural network (as in e.g. [8,  4]), which \nallowed  the learner  to incorporate diverse  parameters of the system, including local \nqueue size  and time of day,  into its distance estimates.  However,  the results of these \nexperiments  were  inconclusive. \n\n1 We  denote  the  function  by  Q because  it  corresponds  to  the  Q function  used  in  the \n\nreinforcement learning  technique  of Q-learning  [10]. \n\n\fPacket Routing in Dynamically Changing Networks: A Reinforcement Learning Approach \n\n673 \n\n-\n-\n\n-\n-\n\n.... \n\n-\n\n-\n.... \n\n\u2022\u2022 \n\n\u2022\u2022 \n\nFigure  1:  The irregular  6 x  6 grid topology \n\n3  RESULTS \n\nWe  tested  the  Q-routing  algorithm  on  a  variety  of network  topologies,  including \nthe 7-hypercube,  a  116-node LATA  telephone network,  and an irregular 6 x  6 grid. \nVarying the network load, we  measured the average delivery  time for  packets in the \nsystem after  learning  had settled  on  a  routing  policy,  and  compared these  delivery \ntimes  with  those  given  by  a  static  routing  scheme  based  on  shortest  paths.  The \nresult  was  that  in  all  cases,  Q-routing is  able  to  sustain  a  higher  level  of network \nload than could shortest  paths. \n\nThis  section  presents  detailed  results  for  the  irregular  grid  network  pictured  in \nFigure  l.  Under  conditions of low  load,  the  network  learns fairly  quickly  to route \npackets along shortest  paths to their destinations.  The performance vs.  time curve \nplotted  in  the  left  part  of  Figure  2  demonstrates  that  the  Q-routing  algorithm, \nafter  an  initial  period  of inefficiency  during  which  it  learns  the  network  topology, \nperforms about as well  as the shortest path router, which is optimal under low  load. \n\nAs  network  load increases,  however,  the  shortest  path routing scheme  ceases  to be \noptimal:  it ignores  the rising  levels  of congestion  and soon floods  the  network  with \npackets.  The  right  part  of Figure  2 plots  performance  vs.  time for  the  two routing \nschemes  under  high  load  conditions:  while  shortest  path  is  unable  to  tolerate  the \npacket load, Q-routing learns an efficient routing policy.  The reason for  the learning \nalgorithm's  success  is  apparent  in  the  \"policy  summary  diagrams\"  in  Figure  3. \nThese  diagrams  indicate,  for  each  node  under  a  given  policy,  how  many  of  the \n36  x  35  point-to-point  routes  go  through  that  node.  In  the  left  part  of Figure  3, \nwhich  summarizes  the  shortest  path  routing  policy,  two  nodes  in  the  center  of \nthe  network  (labeled  570  and  573)  are  on  many  shortest  paths  and  thus  become \ncongested  when  network load is  high.  By  contrast, the  diagram on  the right  shows \nthat  Q-routing,  under  conditions  of high  load,  has  learned  a  policy  which  routes \n\n\fQ-routing  -\n\nShortest paths  ----. \n\n674 \n\nBoyan and Littman \n\n500 \n\n400 \n\n300 \n\n200 \n\n100 \n\n~ \n\nQ) \n\nE \nF \n~ \nQ) \n.~ \nQ) \n0 \nQ) \nCl \n~ \nQ) \n> \n\u00ab \n\n500 \n\n400 \n\n300 \n\n200 \n\n100 \n\nQ-routing  -\n\nShortest paths  ----. \n\n~ \n\nI \n\n, \n\n, \n\n~ \n\nII \nI \n\" \n\nI  I \nI \n\n\" \n:' \n\" \n\nI \nI  I \n\",'  I' \n\nV \n\n!  !:  l  I  I \n1\\ \nII \n,I \n,I, \n\u2022  ,I \n~ \n. , '  \nI \nI \nJ~ .,' ~'  I \n\"' \nI \n\u2022 \nI \n:1  ~,'  I:  IVI~\"  II \nI \nI \n,ft I  I \nI \nII  'I\" \nI  1\"1  ~ I~  I \nI \",~ 1 ,,\\II \nI  ~  I',  II  I  f \nI,  f  '  ~'~I  I '  I  I \n:  ,'''\" \n\\I I  I  1 ,I \n)  \"', I \nl  \", \n~  ,I \n!: ~ \n':1 \n~ \n\",1' \n:: :~ \n\" I, \nI \n, \n, , \n, \n, \nI \n, \n~/~ \nI , \n, \n!~ \n\" .,' , , \nI , \nI , , , \n\" \" j \nI , , \n: \n\n\" I \n\n\\ \n\n0  -------\n\n0 \n\n2000  4000  6000  8000  10000 \n\nSimulator Time \n\n0 \n\n0 \n\n2000  4000  6000  8000  10000 \n\nSimulator Time \n\nFigure 2:  Performance under  low  load and high load \n\n1 ~4--131-+1-7------------1+6--125--1~5 \n\n364--392--396-----------396--393--367 \n\n207  45----5:4 \n\n54----4;3  1 $9 \n\n, \n\n, \n\n375  102---5.9 \n\n, \n\n, \n\n, \n, \n, \n\n, \n\n2rS445'l'{)- ........ 5fS3$&2'?8 \n\n3~4--2t2--2~8-----------~--2tu--3~3 \n\n, \n, \n, \n\n, \n, \n, \n, \n, \n, \n, \n, \n, \n\n, \n\n, \n, \n, \n\n, \n, \n, \n, \n, \n\n, \n\n, \n, \n, \n\n, \n, \n, \n, \n, \n, \n, \n, \n, \n\n, \n\n, \n, \n, \n\n1 tB--1~--219 \n\n~---1-t3--146 \n\n2$s--1~5--110 \n\n1 ~4--1t9--1~8 \n\n, \n\n, \n\n, \n\n, \n, \n, \n, \n, \n, \n, \n, \n\n, \n\n, \n, \n, \n, \n, \n\n, \n, \n, \n, \n, \n\n\" \n\n,  \u2022 \u2022  \n\n,  . .  \n, \n, .  \n2$2--21-8--t 1 4 \n\n' .  \n, .  \n\n1 rB--1t8---~3 \n\n, \n, \n, \n, \n, \n\n, \n, \n, \n, \n, \n\n, \n, \n, \n, \n, \n\n, \n, \n, \n, \n, \n\n5~----~2  3T7 \n\n, \n, \n, \n, \n, \n, \n\n, \n, \n, \n, \n, \n\n, \n, \n, \n, \n, \n, \n, \n, \n\n, \n, \n, \n, \n, \n, \n\nI \n\n\u2022 \n\n. , '  \n' \"  \n. , '  \n, \n\n\u2022 \n\u2022 \n. ,   I \n\n, \n\n, \n\nI \n\nI \n\n2?7--2<t>1--2-1 7 \n\n1 $O--1-+1--H~2 \n\n, \n, \n\n, \n, \n\n, \n, \n\n45----7-6----58 \n\n79---105---=15 \n\n108--121---=14 \n\nFigure  3:  Policy  summaries:  shortest  path and  Q-routing  under  high  load \n\n\fPacket Routing in Dynamically Changing Networks: A Reinforcement Learning Approach \n\n675 \n\nsome traffic over  a longer  than necessary  path (across  the top of the network)  so  as \nto  avoid  congestion in  the center of the  network. \n\nThe  basic  result  is  captured  in  Figure 4,  which  compares the  performances of the \nshortest  path policy and Q-routing learned policy at various levels  of network load. \nEach  point represents  the  median (over  19  trials)  of the mean packet  delivery  time \nafter learning has settled.  When the load is very low,  the Q-routing algorithm routes \nnearly  as  efficiently  as  the  shortest  path  policy.  As  load  increases,  the  shortest \npath  policy  leads  to  exploding  levels  of network  congestion,  whereas  the  learning \nalgorithm continues  to route  efficiently.  Only  after  a  further  significant  increase  in \nload does  the Q-routing  algorithm, too,  succumb  to congestion. \n\nQ-routing  -\n\nShortest paths  ' ' . \n\n, , \n\n_ .. --\"-- \" \n\nW \nu \nc: \nQl u \nIII \nQl \n'5 \n0-\nw \n~ \nQl \nE \ni= \n~ \nQl \n. ~ \nQi \n0 \nQl \n0> \n~ \nQl \n~ \n\n1B \n\n16 \n\n14 \n\n12 \n\n10 \n\nB \n\n6 \n\n4 \n\n0.5 \n\n1.5 \n\n2 \n\n2.5 \n\nNetwork Load Level \n\n3 \n\n3.5 \n\n4 \n\n4.5 \n\nFigure 4:  Delivery  time at  various loads for  Q-routing  and shortest  paths \n\n3.1  DYNAMICALLY  CHANGING  NETWORKS \n\nOne advantage a learning algorithm has over  a static routing policy is  the potential \nfor  adapting to changes in crucial system parameters during network operation.  We \ntested  the  Q-routing  algorithm,  unmodified,  on  networks  whose  topology,  traffic \npatterns,  and load level  were  changing dynamically: \n\nTopology  We  manually  disconnected  links  from  the  network  during  simulation. \nQualitatively,  Q-routing  reacted  quickly  to  such  changes  and  was  able  to \ncontinue routing traffic  efficiently. \n\nTraffic  patterns  We  caused  the  simulation  to  oscillate  periodically  between  two \nvery  different  request  patterns in  the irregular grid:  one in which  all  traffic \nwas  directed  between  the  upper  and  lower  halves  of the  network,  and one \nin  which  all  traffic  was  directed  between  the  left  and  right  halves.  Again, \n\n\f676 \n\nBoyan and Littman \n\nafter only a brief period of inefficient  routing each time the request  pattern \nswitched,  the  Q-routing algorithm adapted successfully. \n\nLoad level  When  the  overall  level  of network  traffic  was  raised  during  simula(cid:173)\ntion,  Q-routing  quickly  adapted  its  policy  to  route  packets  around  new \nbottlenecks.  However,  when  network traffic levels  were  then lowered  again, \nadaptation was  much  slower,  and never  converged  on  the  optimal shortest \npaths.  This effect  is  discussed  in  the next  section . \n\n3.2  EXPLORATION \n\nGiven the similarity between  the Q-routing update equation and the  Bellman-Ford \nrecurrence  for  shortest  paths,  it seems surprising that there is  any  difference  what(cid:173)\nsoever  between  the  performance  of Q-routing  and  shortest  paths  routing  at  low \nload,  as  is  visible  in  Figure  4.  However,  a  close  look  at  the  algorithm reveals  that \nQ-routing cannot fine-tune  a policy to discover shortcuts,  since only the best  neigh(cid:173)\nbor's estimate is ever  updated.  For instance, if a  node learns an overestimate of the \ndelivery  time for  an optimal route,  then it will select  a suboptimal route as  long as \nthat route's  delivery  time is less  than the erroneous  estimate of the optimal route 's \ndelivery  time. \n\nThis drawback of greedy Q-Iearning is  widely recognized in the reinforcement learn(cid:173)\ning community, and several exploration techniques have been suggested to overcome \nit  [9].  A  common one  is  to have  the  algorithm select  actions  with some  amount  of \nrandomness during  the initial learning  period[10].  But  this  approach  has two  seri(cid:173)\nous drawbacks in the context of distributed routing:  (1)  the network is continuously \nchanging,  thus  the  initial period  of exploration never  ends;  and more significantly, \n(2)  random traffic  has  an  extremely  negative  effect  on  congestion .  Packets  sent  in \na  suboptimal direction  tend  to  add  to  queue  delays,  slowing  down  all  the  packets \npassing through those queues,  which adds further  to queue  delays,  etc.  Because the \nnodes  make  their  policy  decisions  based  on  only  local  information,  this  increased \ncongestion  actually changes  the problem the learners  are  trying to solve. \n\nInstead  of sending  actual  packets  in  a  random  direction,  a  node  using  the  \"full \necho\"  modification  of Q-routing  sends  requests  for  information  to  its  immediate \nneighbors  every  time  it  needs  to  make  a  decision .  Each  neighbor  returns  a  single \nnumber-using  a  separate  channel  so  as  to  not  contribute  to  network  congestion \nin our model-giving that node's current estimate of the total time to the  destina(cid:173)\ntion.  These  estimates  are  used  to  adjust  the  Qx(d , y)  values  for  each  neighbor  y. \nWhen shortcuts  appear,  or if there  are  inefficiencies  in  the policy,  this  information \npropagates  very  quickly  through  the network  and the policy  adjusts  accordingly. \n\nFigure  5  compares  the  performance  of Q-routing  and  shortest  paths  routing  with \n\"full echo\"  Q-routing.  At  low  loads the performance of \"full echo\"  Q-routing is  in(cid:173)\ndistinguishable from that of the shortest  path policy,  as all inefficiencies  are purged. \nUnder  high  load  conditions,  \"full  echo\"  Q-routing  outperforms  shortest  paths  but \nthe  basic  Q-routing  algorithm  does  better  still.  Our  analysis  indicates  that  \"full \necho\"  Q-routing constantly changes  policy under high load,  oscillating between  us(cid:173)\ning the upper bottleneck and using the central bottleneck for  the majority of cross(cid:173)\nnetwork traffic.  This behavior is  unstable and generally leads to worse  routing times \nunder  high  load. \n\n\fPacket Routing in Dynamically Changing Networks: A Reinforcement Learning Approach \n\n677 \n\n2l \n\nc: \nQ) \n0 \nUl \n.!!! \n'\" CT \nQ; \n::: \n..!! \nQ) \nE \ni= \n~ \nQ) \n.~ \nQi \n0 \nQ) \nCl \n~ \nQ) \n> \u00ab \n\nQ-routing  -\nShortest paths  -----\nFull Echo  -----\n\n18 \n\n16 \n\n14 \n\n12 \n\n10 \n\n8 \n\n6  ~,\"\"-,~~ \u2022\u2022. _.\".L.~~\" \u2022\u2022 ~\"\u00b7\u00b7----'::\u00b7\u00b7\u00b7\u00b7 . \n\n4 \n\n0.5 \n\n1.5 \n\n2 \n\n2.5 \n\nNetwork Load Level \n\n3 \n\n3.5 \n\n4 \n\n4.5 \n\nFigure  5:  Delivery  time  at  varIOUS  loads  for  Q-routing,  shortest  paths  and  \"full \necho\"  Q-routing \n\nIronically,  the  \"drawback\"  of the  basic  Q-routing  algorithm-that it  does  no  ex(cid:173)\nploration and no  fine-tuning  after  initially learning a  viable  policy-actually leads \nto  improved  performance  under  high  load  conditions.  We  still  know  of no  single \nalgorithm which  performs best  under  all  load conditions. \n\n4  CONCLUSION \n\nThis  work  considers  a  straightforward  application  of  Q-Iearning  to  packet  rout(cid:173)\ning.  The  \"Q-routing\"  algorithm,  without  having  to  know  in  advance  the  network \ntopology and traffic patterns,  and without the need for  any centralized routing con(cid:173)\ntrol  system,  is  able  to  discover  efficient  routing  policies  in  a  dynamically changing \nnetwork.  Although  the  simulations  described  here  are  not  fully  realistic  from  the \nstandpoint of actual telecommunication networks,  we  believe  this paper has shown \nthat  adaptive  routing  is  a  natural  domain for  reinforcement  learning.  Algorithms \nbased on Q-routing but specifically tailored to the packet  routing domain will likely \nperform even  better. \n\nOne  of the most interesting  directions  for  future  work  is  to replace  the  table-based \nrepresentation of the routing policy with a function approximator.  This could allow \nthe  algorithm  to  integrate  more  system  variables  into  each  routing  decision  and \nto generalize over  network destinations.  Potentially, much  less  routing information \nwould  need  to  be  stored  at  each  node,  thereby  extending  the  scale  at  which  the \nalgorithm is useful.  We  plan to explore some of these issues in the context of packet \nrouting or  related  applications such  as  auto traffic  control and elevator control. \n\n\f678 \n\nBoyan and Littman \n\nAcknowledgements \n\nThe  authors  would  like  to thank  for  their  support  the  Bellcore  Cognitive  Science \nResearch Group, the National Defense Science and Engineering Graduate fellowship \nprogram, and  National Science  Foundation Grant IRI-9214873. \n\nReferences \n\n[1]  R.  Bellman.  On  a  routing  problem.  Quarterly  of  Applied  Mathematics, \n\n16(1):87-90,  1958. \n\n[2]  J. Boyan. Modular neural networks for learning context-dependent game strate(cid:173)\n\ngies.  Master's thesis,  Computer Speech  and Language  Processing,  Cambridge \nUniversity,  1992. \n\n[3]  L.  R.  Ford,  Jr.  Flows  in  Networks.  Princeton  University  Press,  1962. \n[4]  L.-J .  Lin.  Reinforcement  Learning  for  Robots  Using  Neural  Networks.  PhD \n\nthesis,  School of Computer Science,  Carnegie  Mellon  University,  1993. \n\n[5]  M.  Littman and J. Boyan.  A distributed reinforcement learning scheme for  net(cid:173)\nwork  routing.  Technical Report  CMU-CS-93-165, School of Computer Science, \nCarnegie Mellon  University,  1993. \n\n[6]  H.  Rudin.  On routing  and  delta routing:  A taxonomy  and  performance com(cid:173)\n\nparison  of  techniques  for  packet-switched  networks. \nCommunications,  COM-24(1):43-59, January  1976. \n\nIEEE  Transactions  on \n\n[7]  A.  Tanenbaum .  Computer  Networks.  Prentice-Hall,  second  edition  edition, \n\n1989. \n\n[8]  G.  Tesauro.  Practial issues  in temporal difference  learning.  Machine  Learning, \n\n8(3/4),  May  1992. \n\n[9]  Sebastian  B.  Thrun.  The role  of exploration in  learning control.  In  David A. \nWhite  and  Donald A.  Sofge,  editors,  Handbook  of Intelligent  Control:  Neural, \nFuzzy,  and  Adaptive  Approaches.  Van  Nostrand  Reinhold,  New  York,  1992. \n\n[10]  C .  Watkins.  Learning  from  Delayed  Rewards.  PhD  thesis,  King's  College, \n\nCambridge,  1989. \n\n\f", "award": [], "sourceid": 770, "authors": [{"given_name": "Justin", "family_name": "Boyan", "institution": null}, {"given_name": "Michael", "family_name": "Littman", "institution": null}]}