{"title": "Optimal Asset Allocation using Adaptive Dynamic Programming", "book": "Advances in Neural Information Processing Systems", "page_first": 952, "page_last": 958, "abstract": null, "full_text": "Optimal Asset  Allocation \n\n\u2022 uSIng \n\nAdaptive Dynamic Programming \n\nRalph Neuneier* \n\nSiemens AG,  Corporate Research and  Development \n\nOtto-Hahn-Ring 6,  D-81730 Munchen,  Germany \n\nAbstract \n\nIn  recent  years,  the  interest  of investors  has  shifted  to  computer(cid:173)\nized asset allocation (portfolio management) to exploit the growing \ndynamics of the capital markets.  In  this paper,  asset  allocation is \nformalized  as  a  Markovian  Decision  Problem  which  can  be  opti(cid:173)\nmized  by applying dynamic programming or reinforcement learning \nbased  algorithms.  Using an artificial exchange rate,  the asset  allo(cid:173)\ncation strategy optimized with reinforcement learning (Q-Learning) \nis  shown  to  be  equivalent  to  a  policy  computed  by  dynamic  pro(cid:173)\ngramming.  The approach is  then tested on the task to invest liquid \ncapital  in  the  German  stock  market.  Here,  neural  networks  are \nused  as  value function  approximators.  The  resulting  asset  alloca(cid:173)\ntion  strategy  is  superior  to  a  heuristic  benchmark  policy.  This  is \na  further  example  which  demonstrates  the  applicability  of neural \nnetwork  based  reinforcement  learning  to  a  problem setting with  a \nhigh  dimensional state space. \n\n1 \n\nIntroduction \n\nBillions of dollars are daily pushed  through the international capital markets while \nbrokers shift their investments to  more promising assets.  Therefore,  there is a  great \ninterest in achieving a deeper understanding of the capital markets and in developing \nefficient  tools for  exploiting the dynamics of the  markets. \n\n* Ralph.Neuneier@zfe.siemens.de,  http://www.siemens.de/zfe.Jlll/homepage.html \n\n\fOptimal  Asset Allocation  Using  Adaptive  Dynamic  Programming \n\n953 \n\nAsset allocation (portfolio management) is the investment of liquid capital to various \ntrading opportunities like stocks, futures,  foreign exchanges and others.  A portfolio \nis  constructed  with  the  aim  of achieving  a  maximal expected  return  for  a  given \nrisk  level  and  time  horizon.  To  compose  an  optimal  portfolio,  the  investor  has \nto  solve  a  difficult  optimization problem  consisting  of two  phases  (Brealy,  1991). \nFirst,  the  expected  yields  are  estimated simultaneously with a  certainty  measure. \nSecond,  based  on  these  estimates,  a  portfolio is  constructed  obeying  the  risk  level \nthe investor is willing to accept  (mean-variance techniques).  The problem is further \ncomplicated  if transaction  costs  must  be  considered  and  if the  investor  wants  to \nrevise  the decision at every  time step.  In  recent  years,  neural  networks  (NN)  have \nbeen  successfully  used  for  the  first  task.  Typically,  a  NN  delivers  the  expected \nfuture  values of a  time series  based on data of the past.  Furthermore,  a  confidence \nmeasure which expresses  the  certainty of the  prediction is provided. \n\nIn  the  following,  the  modeling  phase  and  the  search  for  an  optimal  portfolio  are \ncombined and embedded in  the framework  of Markovian  Decision  Problems,  MDP. \nThat theory formalizes control problems within stochastic environments (Bertsekas, \n1987,  Elton,  1971).  If the discrete  state space  is  small and if an accurate  model of \nthe system is available, MDP can be solved by conventional Dynamic Programming, \nDP.  On  the  other extreme,  reinforcement  learning  methods,  e.g.  Q-Learning,  QL, \ncan be applied  to problems with  large state spaces and with no  appropriate model \navailable (Singh,  1994). \n\n2  Portfolio  Managelnent  is  a  Markovian  Decision  Problem \n\nThe following simplifications do not restrict the generalization of the proposed meth(cid:173)\nods with respect  to real applications but will help to clarify the relationship between \nMDP and portfolio optimization. \n\n\u2022  There  is  only  one  possible  asset  for  a  Deutsch-Mark  based  investor,  say  a \n\nforeign  currency  called  Dollar, US-$. \n\n\u2022  The investor is small and does not influence  the market by her/his trading. \n\n\u2022  The investor  has no risk  aversion  and always invests  the total amount. \n\n\u2022  The investor may trade at each  time step for  an infinite  time horizon. \n\nMDP  provide  a  model  for  multi-stage decision  making problems  in  stochastic  en(cid:173)\nvironments.  MDP  can  be  described  by  a  finite  state  set  S  =  1, ... , n,  a  finite  set \nU (i)  of admissible  control  actions  for  every  state  i  E  S,  a  set  of transition  prob(cid:173)\nabilities P0'  which  describe  the  dynamics  of the  system,  and  a  return  function 1 \nr(i,j,u(i)),with i,j  E  S,u(i)  E  U(i).  Furthermore,  there  is  a  stationary  policy \nrr(i),  which delivers for every state an admissible action u(i).  One can compute the \nvalue-function  l;j11\"  of a  given  state and policy, \n\n00 \n\nVi:  =  E[I:'\"-/R(it,rr(it ))), \n\nt=o \n\n(1) \n\n1 In  the  MDP-literature,  the  return often depends  only  on  the  current  state  i,  but  the \n\ntheory  extends  to  the case  of r  =  r(i,j,u(i))  (see  Singh,  1994). \n\n\f954 \n\nR.  NEUNEIER \n\nwhere  E  indicates the expected  value,  'Y  is  the discount  factor  with 0  ~ 'Y  < 1,  and \nwhere  R  are  the  expected  returns,  R  = Ej(r(i, j, u(i)).  The aim is  now  to  find  a \npolicy  71\"*  with  the optimal value-function Vi*  =  max?!\"  Vi?!\"  for all states. \n\nIn the context discussed  here,  a  state vector consists of elements which describe  the \nfinancial time series,  and of elements which quantify the current value of the invest(cid:173)\nment.  For the simple example above,  the state vector  is  the  triple of the exchange \nrate,  Xt,  the wealth of the portfolio,  Ct,  expressed  in the basis currency  (here  DM), \nand  a  binary  variable  b,  representing  the  fact  that  currently  the  investment  is  in \nDM  or US-$. \n\nNote,  that  out  of the  variables  which  form  the state  vector,  the  exchange  rate  is \nactually independent  of the  portfolio decisions,  but the wealth  and  the returns are \nnot.  Therefore, asset allocation is a control problem and may not be reduced  to pure \nprediction. 2  This problem has the attractive feature  that,  because  the investments \ndo not influence  the exchange rate,  we  do not need  to invest  real money during the \ntraining phase of QL until  we  are convinced  that our strategy works. \n\n3  Dynamic  Programming:  Off-line  and  Adaptive \n\nThe optimal  value function  V*  is  the  unique  solution  of the  well-known  Bellman \nequation  (Bertsekas,  1987).  According  to  that  equation  one  has  to  maximize the \nexpected  return  for  the  next  step and follow  an optimal policy  thereafter  in  order \nto  achieve  global  optimal  behavior  (Bertsekas,  1987).  An  optimal policy  can  be \neasily derived from V*  by choosing a  71\"( i)  which satisfies the Bellman equation.  For \nnonlinear systems and non-quadric cost functions,  V*  is typically found  by using an \niterative  algorithm,  value  iteration,  which  converges  asymptotically to  V*.  Value \niteration applies  repeatedly  the operator T  for all  states  i, \n\n(2) \n\nValue iteration  assumes  that the expected  return  function  R(i, u(i))  and  the tran(cid:173)\nsition  probabilities  pij  (i.  e.  the  model)  are  known.  Q-Learning,  (QL),  is  a \nreinforcement-learning  method  that  does  not  require  a  model  of the  system  but \noptimizes  the  policy  by  sampling state-action  pairs  and  returns  while  interacting \nwith  the system  (Barto,  1989).  Let's assume that the investor executes  action u(i) \nat state  i,  and  that the system  moves to a  new  state j.  Let  r(i, j, u(i))  denote  the \nactual return.  QL then  uses  the  update equation \n\nQ(i, u(i)) \n\nQ(k, v) \n\n(1  - TJ)Q(i, u(i)) + TJ(r(i, j, u(i)) + 'Yma:xQ(j, u(j))) \nQ(k, v),  for  all  k oF  i  and voF  u(i) \n\nu(J ) \n\n(3) \n\nwhere  TJ  is  the  learning  rate  and  Q(i, u(i))  are  the  tabulated  Q-values.  One  can \nprove,  that this relaxation algorithm converges  (under some conditions)  to  the op(cid:173)\ntimal Q-values  (Singh,  1994). \n\n2To  be more  precise,  the problem  only  becomes  a  mUlti-stage  decision  problem  if the \n\ntransaction costs are included  in  the problem. \n\n\fOptimal  Asset  Allocation  Using  Adaptive  Dynamic  Programming \n\n955 \n\nThe  selection  of the  action  u( i)  should  be  guided  by  the  trade-off between  explo(cid:173)\nration and exploitation.  In the beginning, the actions are typically chosen randomly \n(exploration)  and  in  the  course  of training,  actions  with  larger  Q-values  are  cho(cid:173)\nsen  with increasingly higher probability  (exploitation).  The implementation in  the \nfollowing experiments  is  based  on  the  Boltzmann-distribution using  the  actual  Q(cid:173)\nvalues and a  slowly decreasing  temperature parameter (see  Barto,  1989). \n\n4  Experiment I:  Artificial  Exchange  Rate \n\nIn  this  section  we  use  an  exchange-rate  model  to  demonstrate  how  DP  and  Q(cid:173)\nLearning can be used  to optimize asset allocation. \n\nThe  artificial  exchange  rate  Xt  is  in  the  range  between  1  and  2  representing  the \nvalue  of 1  US-$  in  DM.  The  transition  probabilities  Pij  of the  exchange  rate  are \nchosen  to  simulate a  situation  where  the  Xt  follows  an  increasing  trend,  but with \nhigher  values  of Xt,  a  drop  to  very  low  values  becomes  more and  more  probable. \nA  realization  of the  time series  is  plotted  in the  upper  part of fig.  2.  The  random \nstate variable Ct  depends on  the investor's decisions  Ut, and is  further influenced  by \nXt,  Xt+b  and Ct-l.  A complete state vector consists of the current exchange rate  Xt \nand  the capital Ct,  which  is  always calculated in  the  basis  currency  (DM).  Its sign \nrepresents  the  actual  currency,  i.  e.,  Ct  =  -1.2 stands  for  an  investment  in  US-$ \nworth of 1.2 DM, and Ct  = 1.2 for a capital of 1.2 DM. Ct  and Xt  are discretized in  10 \nbins each.  The  transaction costs ~ = 0.1 + Ic/IOOI  are a  combination of fixed  (0.1) \nand  variable costs  (Ic/IOOI).  Transactions  only  apply,  if the  currency  is  changed \nfrom DM to US-$.  The immediate return  rt(Xt,ct, Xt+1,  ut) is computed as in table \n1.  If the  decision  has  been  made to  change  the  portfolio  into  DM  or  to  keep  the \nactual  portfolio in  DM,  Ut  =  DM,  then  the  return  is  always  zero.  If the  decision \nhas  been  made to  change  the  portfolio into US-$  or to  keep  the actual  portfolio in \nUS-$,  Ut  = US-$,  then  the  return  is  equal  to  the  relative  change  of the  exchange \nrate  weighted  with  Ct.  That  return  is  reduced  by  the  transaction  costs e,  if the \ninvestor has  to change into US-$. \n\nTable 1:  The immediate return function. \nUt  = US-$ \n\nUt  =DM \n\nCt  E DM \nCt  E US-$ \n\no \no \n\nThe success  of the  strategies  was  tested  on a  realization  (2000  data points)  of the \nexchange rate.  The initial investment is  1 DM, at each  time step the algorithm has \nto decide  to either change the currency or remain in  the  present currency. \n\nAs  a  reinforcement  learning  method,  QL  has  to interact  with  the  environment  to \nlearn  optimal behavior.  Thus,  a  second  set  of 2000  data was  used  to  learn  the  Q(cid:173)\nvalues.  The training phase is  divided into epochs.  Each epoch  consists of as many \ntrials  as  data exist  in  the  training  set.  At  every  trial  the  algorithm  looks  at  Xt, \nchooses  randomly a  portfolio value  Ct  and  selects  a  decision.  Then  the  immediate \nreturn  and the new  state is evaluated  to apply eq.  3.  The Q-values  were  initialized \nwith zero,  the learning rate  T}  was  0.1.  Convergence was  achieved  after 4 epochs. \n\n\f956 \n\nR.  NEUNEIER \n\n$ \n\nDM \n2 \n\n~02 \n\n04 \n~03 \n.s \n1t'  o  1 \no \n2 \n\n2 \n\n2 \n\n-2  1 \n\nFigure  1:  The optimal decisions  (left)  and value function  (right). \n\n. \n\n1 0 \n\n60 \n\n70 \n\n10 \n\n50 \n\n40 \n\n.\n\n' \n20\"  30' \n\n. . .   . \n- _ .  \n. \n.  _ .  \n. \n. . .   . \n.  _ .  \n. \n.\n. \n\na': \n~::[:  : \n:  ~ o \n~]  :ONJD:V,  U: \n\n:IT] \n\n: \n\n90 \n\n100 \n\n90 \n\n100 \n\n30 \n\n40 \n\n50 \n\n70 \n\n80 \n\n10 \n\n20 \n\n80 \n\n10 \n\n20 \n\n30 \n\n40 \n\n60 \n\n70 \n\n80 \n\n90 \n\n100 \n\n60 \n\no \n\n50 \n\nTime \n\nFigure 2:  The exchange  rate  (top),  the capital and  the decisions  (bottom). \n\nTo  evaluate  the  solution  QL  has  found,  the  DP-algorithm from  eq.  2  was  imple(cid:173)\nmented  using  the  given  transition  probabilities.  The convergence  of DP  was  very \nfast.  Only 5 iterations were  needed  until  the average difference  between  successive \nvalue  functions  was  lower  than  0.01.  That  means  500  updates  in  comparison  to \n8000  updates with QL. \n\nThe solutions were  identical with respect  to the resulting policy which  is  plotted in \nfig.  1,  left.  It can  clearly  be  seen,  that  there  is  a  difference  between  the  policy  of \na  DM-based  and a  US-$-based  portfolio.  If one  has already changed  the capital to \nUS-$,  then it is  advisable to keep  the  portfolio in  US-$  until the risk gets  too high, \ni.  e.  Xt  E  {1.8, 1.9}.  On  the other hand,  if Ct  is  still  in  DM,  the risk  barrier moves \nto  lower  values  depending  on  the  volume of the  portfolio.  The  reason  is  that  the \npotential  gain  by  an  increasing  exchange  rate  has  to  cover  the  fixed  and  variable \ntransaction costs.  For very low values of Ct,  it is forbidden to change even  at low  Xt \nbecause  the fixed  transaction costs  will  be higher than any gain.  Figure 2 plots the \n\n\fOptimal  Asset  Allocation  Using  Adaptive  Dynamic  Programming \n\n957 \n\nexchange  rate  Xt,  the accumulated capital Ct  for  100 days, and the decisions  Ut. \n\nLet us look at a few  interesting decisions.  At the beginning, t  =  0,  the portfolio was \nchanged  immediately to  US-$  and kept  there for  13  steps  until a  drop  to  low  rates \nXt  became very probable.  During the time steps 35-45,  the 'O'xchange  rate oscillated \nat  higher  exchange  rates.  The  policy  insisted  on  the  DM  portfolio,  because  the \nrisk  was  too  high.  In  contrary,  looking at  the  time steps  24  to  28,  the  policy  first \nswitched back  to  DM,  then  there was a  small decrease  of Xt  which  was sufficient  to \nlet  the  investor  change  again.  The following  increase  justified  that  decision.  The \nsuccess  of the resulting strategy can be easily recognized  by the continuous increase \nof the  portfolio.  Note,  that  the  ups  and  downs  of the  portfolio curve  get  higher \nin  magnitude at the  end  because  the  investor  has no  risk  aversion  and always  the \nwhole capital is  traded. \n\n5  Experiment II:  German Stock Index DAX \n\nIn  this section  the approach is  tested on a  real  world task:  assume that an investor \nwishes to invest her Ihis capital into a block of stocks which behaves like the German \nstock index DAX.  We  based  the  benchmark strategy  (short:  MLP)  on a  NN  model \nwhich  was  build  to  predict  the  daily changes  of the  DAX  (for  details,  see  Dichtl, \n1995).  If the prediction of the next day  DAX difference  is  positive then  the capital \nis  invested  into  DAX  otherwise  in  DM.  The  input  vector  of the  NN  model  was \ncarefully  optimized for  optimal prediction.  We  used  these  inputs  (the  DAX  itself \nand  11  other  influencing  market  variables)  as  the  market  description  part  of the \nstate  vector  for  QL.  In  order  to  store  the  value  functions  two  NNs,  one for  each \naction,  with 8 nonlinear hidden  neurons and one linear output are used. \n\nThe data is split into a  training (from 2.  Jan.  1986  to  31.  Dec.  1992)  and a  test set \n(from  2.  Jan.  1993  to  18.  Oct.  1995).  The  return  function  is  defined  in  the  same \nway as in section  4  using 0.4% as proportional costs and 0.001  units as fixed  costs, \nwhich  are  realistic  for  financial  institutions.  The  training  proceeds  as  outlined  in \nthe previous section  with  TJ  =  0.001 for  1000 epochs. \n\nIn fig.  3 the development of a  reinvested capital is  plotted for  the optimized (upper \nline)  and the  MLP strategy  (middle line).  The  DAX itself is  also plotted  but with \na  scaling  factor  to  fit  it  into  the  figure  (lower  line).  The  resulting  policy  by  QL \nclearly beats the  benchmark  strategy because  the  extra return  amounts to  80% at \nthe  end  of the  training  period  and  to  25%  at  the end  of the  test  phase.  A  closer \nlook  at some statistics can  explain  the success.  The QL  policy proposes  almost as \noften  as  the  MLP  policy  to  invest  in  DAX,  but  the  number  of changes  from  DM \nto  DAX  and  v.  v.  is  much  lower  (see  table 2).  Furthermore,  it seems  that the  QL \nstrategy keeps  the capital out of the market if there is  no significant trend to follow \nand  the  market shows  too  much volatility (see  fig.  3 with  straight horizontal lines \nof the capital development curve indicating no  investments).  An extensive analysis \nof the resulting strategy will  be the  topic of future research. \n\nIn a  further  experiment the NNs  which store the Q-values are initialized to imitate \nthe  MLP  strategy.  In  some runs  the  number of necessary  epochs  were  reduced  by \na  factor  of 10.  But  often  the  QL  algorithni  took  longer  to  converge  because  the \ninitialization ignores  the  input  elements  which  describe  the  investor's  capital  and \ntherefore  led  to a  bad starting point in  the weight space. \n\n\f958 \n\nR. NEUNEIER \n\n4S,r--------------------------, \n\n, 7.-----------------------------, \n\n,  Jan  1993 \n\n18  Od  H:195 \n\nFigure 3:  The development of a  reinvested capital on  the training (left) and test set \n(right).  The lines from  top  to bottom:  QL-strategy,  MLP-strategy,  scaled DAX. \n\nTable 2:  Some statistics of the  policies. \n\nDAX investments \n\nposition changes \n\nTraining set \nTest set \n\nData  MLP  Policy  QL-Policy  MLP  Policy  QL-Policy \n1825 \n729 \n\n1005 \n395 \n\n1020 \n434 \n\n904 \n344 \n\n284 \n115 \n\n6  Conclusions  and  Future  Work \n\nIn  this  paper,  the  task  of asset  allocation/portfolio management was  approached \nby  reinforcement  learning algorithms.  QL  was  successfully  utilized  in combination \nwith NNs as  value function  approximators in a  high dimensional state space. \n\nFuture work  has  to address  the possibility of several alternative investment oppor(cid:173)\ntunities  and  to  clarify  the  connection  to  the  classical  mean-variance  approach  of \nprofessional  brokers.  The  benchmark  strategy  in  the  real  world  experiment  is  in \nfact  a  neuro-fuzzy  model which  allows the extraction of useful  rules after learning. \nIt will  be  interesting  to  use  that  network  architecture  to  approximate  the  value \nfunction  in order to achieve  a  deeper  insight in the resulting optimized strategy. \n\nReferences \n\nBarto A. G., Sutton R.  S. and Watkins C. J. C. H.  (1989) , Learning and Sequential Decision \nMaking,  COINS TR 89-95. \nBertsekas  D.  P.  (1987) , Dynamic  Programming, NY:  Wiley. \nSingh,  P.  S. (1993) , Learning  to Solve  Markovian  Decision  Processes,  CMPSCI TR 93-77. \nNeuneier R.  (1995),  Optimal Strategies with Density-Estimating Neural Networks, ICANN \n95,  Paris. \nBrealy,  R.  A. , Myers,  S.  C. (1991),  Principles  of Corporate Finance,  McGraw-Hill. \nWatkins  C. J., Dayan, P.  (1992) , Technical  Note:  Q-Learning, Machine  Learning 8,  3/4. \nElton,  E. J.  , Gruber,  M.  J.  (1971),  Dynamic  Programming Applications  in Finance, The \nJournal  of Finance,  26/2. \nDichtl,  H.  (1995),  Die  Prognose  des  DAX  mit  Neuro-Fuzzy,  masterthesis,  engl.  abstract \nin preparation. \n\n\f", "award": [], "sourceid": 1121, "authors": [{"given_name": "Ralph", "family_name": "Neuneier", "institution": null}]}