{"title": "A Model-Based Reinforcement Learning with Adversarial Training for Online Recommendation", "book": "Advances in Neural Information Processing Systems", "page_first": 10735, "page_last": 10746, "abstract": "Reinforcement learning is effective in optimizing policies for recommender systems. Current solutions mostly focus on model-free approaches, which require frequent interactions with a real environment, and thus are expensive in model learning. Offline evaluation methods, such as importance sampling, can alleviate such limitations, but usually request a large amount of logged data and do not work well when the action space is large. In this work, we propose a model-based reinforcement learning solution which models the user-agent interaction for offline policy learning via a generative adversarial network. To reduce bias in the learnt policy, we use the discriminator to evaluate the quality of generated sequences and rescale the generated rewards. Our theoretical analysis and empirical evaluations demonstrate the effectiveness of our solution in identifying patterns from given offline data and learning policies based on the offline and generated data.", "full_text": "Model-Based Reinforcement Learning with\n\nAdversarial Training for Online Recommendation\n\nXueying Bai\u2217\u2021, Jian Guan\u2217\u00a7, Hongning Wang\u2020\n\n\u00a7 Department of Computer Science and Technology, Tsinghua University\n\n\u2021Department of Computer Science, Stony Brook University\n\u2020 Department of Computer Science, University of Virginia\n\nxubai@cs.stonybrook.edu, j-guan19@mails.tsinghua.edu.cn\n\nhw5x@virginia.edu\n\nAbstract\n\nReinforcement learning is well suited for optimizing policies of recommender\nsystems. Current solutions mostly focus on model-free approaches, which require\nfrequent interactions with the real environment, and thus are expensive in model\nlearning. Of\ufb02ine evaluation methods, such as importance sampling, can alleviate\nsuch limitations, but usually request a large amount of logged data and do not\nwork well when the action space is large. In this work, we propose a model-based\nreinforcement learning solution which models user-agent interaction for of\ufb02ine\npolicy learning via a generative adversarial network. To reduce bias in the learned\nmodel and policy, we use a discriminator to evaluate the quality of generated data\nand scale the generated rewards. Our theoretical analysis and empirical evaluations\ndemonstrate the effectiveness of our solution in learning policies from the of\ufb02ine\nand generated data.\n\nIntroduction\n\n1\nRecommender systems have been successful in connecting users with their most interested content in\na variety of application domains. However, because of users\u2019 diverse interest and behavior patterns,\nonly a small fraction of items are presented to each user, with even less feedback recorded. This gives\nrelatively little information on user-system interactions for such a large state and action space [2], and\nthus brings considerable challenges to construct a useful recommendation policy based on historical\ninteractions. It is important to develop solutions to learn users\u2019 preferences from sparse user feedback\nsuch as clicks and purchases [11, 13] to further improve the utility of recommender systems.\nUsers\u2019 interests can be short-term or long-term and re\ufb02ected by different types of feedback [35]. For\nexample, clicks are generally considered as short-term feedback which re\ufb02ects users\u2019 immediate\ninterests during the interaction, while purchase reveals users\u2019 long-term interests which usually\nhappen after several clicks. Considering both users\u2019 short-term and long-term interests, we frame the\nrecommender system as a reinforcement learning (RL) agent, which aims to maximize users\u2019 overall\nlong-term satisfaction without sacri\ufb01cing the recommendations\u2019 short-term utility [28].\nClassical model-free RL methods require collecting large quantities of data by interacting with\nthe environment, e.g., a population of users. Therefore, without interacting with real users, a\nrecommender cannot easily probe for reward in previously unexplored regions in the state and action\nspace. However, it is prohibitively expensive for a recommender to interact with users for reward\nand model updates, because bad recommendations (e.g., for exploration) hurt user satisfaction and\nincrease the risk of user drop out. In this case, it is preferred for a recommender to learn a policy by\nfully utilizing the logged data that is acquired from other policies (e.g., previously deployed systems)\n\n\u2217Both authors contributed equally.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\finstead of direct interactions with users. For this purpose, we take a model-based learning approach\nin this work, in which we estimate a model of user behavior from the of\ufb02ine data and use it to interact\nwith our learning agent to obtain an improved policy simultaneously.\nModel-based RL has a strong advantage of being sample ef\ufb01cient and helping reduce noise in of\ufb02ine\ndata. However, such an advantage can easily diminish due to the inherent bias in its model approxi-\nmation of the real environment. Moreover, dramatic changes in subsequent policy updates impose\nthe risk of decreased user satisfaction, i.e., inconsistent recommendations across model updates.\nTo address these issues, we introduce adversarial training into a recommender\u2019s policy learning\nfrom of\ufb02ine data. The discriminator is trained to differentiate simulated interaction trajectories from\nreal ones so as to debias the user behavior model and improve policy learning. To the best of our\nknowledge, this is the \ufb01rst work to explore adversarial training over a model-based RL framework for\nrecommendation. We theoretically and empirically demonstrate the value of our proposed solution in\npolicy evaluation. Together, the main contributions of our work are as follows:\n\u2022 To avoid the high interaction cost, we propose a uni\ufb01ed solution to more effectively utilize\nthe logged of\ufb02ine data with model-based RL algorithms, integrated via adversarial training.\nIt enables robust recommendation policy learning.\n\n\u2022 The proposed model is veri\ufb01ed through theoretical analysis and extensive empirical eval-\nuations. Experiment results demonstrate our solution\u2019s better sample ef\ufb01ciency over the\nstate-of-the-art baselines 2\n\n2 Related Work\nDeep RL for recommendation There have been studies utilizing deep RL solutions in news, music\nand video recommendations [17, 15, 38]. However, most of the existing solutions are model-free\nmethods and thus do not explicitly model the agent-user interactions. In these methods, value-\nbased approaches, such as deep Q-learning [20], present unique advantages such as seamless off-\npolicy learning, but are prone to instability with function approximation [30, 19]. And the policy\u2019s\nconvergence in these algorithms is not well-studied. In contrast, policy-based methods such as policy\ngradient [14] remain stable but suffer from data bias without real-time interactive control due to\nlearning and infrastructure constraints. Oftentimes, importance sampling [22] is adopted to address\nthe bias but instead results in huge variance [2]. In this work, we rely on a policy gradient based RL\napproach, in particular, REINFORCE [34]; but we simultaneously estimate a user behavior model to\nprovide a reliable environment estimate so as to update our agent on policy.\nModel-based RL Model-based RL algorithms incorporate a model of the environment to predict\nrewards for unseen state-action pairs. It is known in general to outperform model-free solutions in\nterms of sample complexity [7], and has been applied successfully to control robotic systems both in\nsimulation and real world [5, 18, 21, 6]. Furthermore, Dyna-Q [29, 24] integrates model-free and\nmodel-based RL to generate samples for learning in addition to the real interaction data. Gu et al.\n[10] extended these ideas to neural network models, and Peng et al. [24] further apply the method on\ntask-completion dialogue policy learning. However, the most ef\ufb01cient model-based algorithms have\nused relatively simple function approximations, which actually have dif\ufb01culties in high-dimensional\nspace with nonlinear dynamics and thus lead to huge approximation bias.\nOf\ufb02ine evaluation The problems of off-policy learning [22, 25, 26] and of\ufb02ine policy evaluation are\ngenerally pervasive and challenging in RL, and in recommender systems in particular. As a policy\nevolves, so does the distribution under which the expectation of gradient is computed. Especially\nin the scenario of recommender systems, where item catalogues and user behavior change rapidly,\nsubstantial policy changes are required; and therefore it is not feasible to take the classic approaches\n[27, 1] to constrain the policy updates before new data is collected under an updated policy. Multiple\noff-policy estimators leveraging inverse-propensity scores, capped inverse-propensity scores and\nvarious variance control measures have been developed [33, 32, 31, 8] for this purpose.\nRL with adversarial training Yu et al. [36] propose SeqGAN to extend GANs with an RL-like gen-\nerator for the sequence generation problem, where the reward signal is provided by the discriminator\nat the end of each episode via a Monte Carlo sampling approach. The generator takes sequential\nactions and learns the policy using estimated cumulative rewards. In our solution, the generator\nconsists of two components, i.e., our recommendation agent and the user behavior model, and we\n\n2Our implementation is available at https://github.com/JianGuanTHU/IRecGAN.\n\n2\n\n\fmodel the interactive process via adversarial training and policy gradient. Different from the sequence\ngeneration task which only aims to generate sequences similar to the given observations, we leverage\nadversarial training to help reduce bias in the user model and further reduce the variance in training\nour agent. The agent learns from both the interactions with the user behavior model and those stored\nin the logged of\ufb02ine data. To the best of our knowledge, this is the \ufb01rst work that utilizes adversarial\ntraining for improving both model approximation and policy learning on of\ufb02ine data.\n\n3 Problem Statement\n\nThe problem is to learn a policy from of\ufb02ine data such that when deployed online it maximizes\ncumulative rewards collected from interactions with users. We address this problem with a model-\nbased reinforcement learning solution, which explicitly model users\u2019 behavior patterns from data.\nProblem A recommender is formed as a learning agent to generate actions under a policy, where\neach action gives a recommendation list of k items. Every time through interactions between the\nagent and the environment (i.e., users of the system), a set \u2126 of n sequences \u2126 = {\u03c41, ..., \u03c4n}\nis recorded, where \u03c4i is the i-th sequence containing agent actions, user behaviors and rewards:\n\u03c4i = {(ai\nt (e.g., make a purchase),\nand ci\nt (e.g., click on a recommended\nitem). For simplicity, in the rest of paper, we drop the superscript i to represent a general sequence \u03c4.\nBased on the observed sequences, a policy \u03c0 is learnt to maximize the expected cumulative reward\n\nt is the associated user behavior corresponding to agent\u2019s action ai\n\nt represents the reward on ai\n\n1), ..., (ai\n\nt)}, ri\n\n0), (ai\n\n0, ri\n\n1, ci\n\n1, ri\n\nt, ci\n\nt, ri\n\n0, ci\n\nE\u03c4\u223c\u03c0[(cid:80)T\n\nt=0 rt], where T is the end time of \u03c4.\n\nAssumption To narrow the scope of our discussion, we study a typical type of user behavior, i.e.,\nclicks, and make following assumptions: 1) at each time a user must click on one item from the\nrecommendation list; 2) items not clicked in the recommendation list will not in\ufb02uence the user\u2019s\nfuture behaviors; 3) rewards only relate to clicked items. For example, when taking the user\u2019s\npurchase as reward, purchases can only happen in the clicked items.\nLearning framework In a Markov Decision Process, an environment consists of a state set S, an\naction set A, a state transition distribution P : S \u00d7 A \u00d7 S, and a reward function fr : S \u00d7 A \u2192 R,\nwhich maps a state-action pair to a real-valued scalar. In this paper, the environment is modeled as a\nuser behavior model U, and learnt from of\ufb02ine log data. S is re\ufb02ected by the interaction history before\ntime t, and P captures the transition of user behaviors. In the meanwhile, based on the assumptions\nmentioned above, at each time t, the environment generates user\u2019s click ct on items recommended by\nan agent A in at based on his/her click probabilities under the current state; and the reward function\nfr generates reward rt for the clicked item ct.\nOur recommendation policy is learnt from both of\ufb02ine data and data sampled from the learnt user\nbehavior model, i.e., a model-based RL solution. We incorporate adversarial training in our model-\nbased policy learning to: 1) improve the user model to ensure the sampled data is close to true data\ndistribution; 2) utilize the discriminator to scale rewards from generated sequences to further reduce\nbias in value estimation. Our proposed solution contains an interactive model constructed by U and\nA, and an adversarial policy learning approach. We name the solution as Interactive Recommender\nGAN, or IRecGAN in short. The overview of our proposed solution is shown in Figure 1.\n\n4\n\nInteractive Modeling for Recommendation\n\nWe present our interactive model for recommendation, which consists of two components: 1) the\nuser behavior model U that generates user clicks over the recommended items with corresponding\nrewards; and 2) the agent A which generates recommendations according to its policy. U and A\ninteract with each other to generate user behavior sequences for adversarial policy learning.\nUser behavior model Given users\u2019 click observations {c0, c1, ..., ct\u22121}, the user behavior model\nU \ufb01rst projects the clicked item into an embedding vector eu at each time 3. The state su\nt can be\nrepresented as a summary of click history, i.e., su\nt\u22121). We use a recurrent neural\nnetwork to model the state transition P on the user side, thus for the state su\n\nt = hu(eu\n\n1 , ...eu\n\n0 , eu\n\nt we have,\n\nsu\nt = hu(su\n\nt\u22121, eu\n\nt\u22121),\n\n3As we can use different embeddings on the user side and agent side, we use the superscript u and a to\n\ndenote this difference accordingly.\n\n3\n\n\fFigure 1: Model overview of IRecGAN. A, U and D denote the agent model, user behavior model,\nand discriminator, respectively. In IRecGAN, A and U interact with each other to generate recom-\nmendation sequences that are close to the true data distribution, so as to jointly reduce bias in U and\nimprove the recommendation quality in A.\n\n(cid:88)|at|\n\nhu(\u00b7,\u00b7) can be functions in the RNN family like GRU [4] and LSTM [12] cells. Given the action\nat = {at(1), ...at(k)}, i.e., the top-k recommendations at time t, we compute the probability of click\namong the recommended items via a softmax function,\n\nt + bc)(cid:62)Eu\n\nt , p(ct|su\n\nj=1\n\ni )/\n\nt , Eu\n\nexp(Vc\nj)\n\nVc = (Wcsu\n\nt , at) = exp(Vc\n\n(1)\nwhere Vc \u2208 Rk is a transformed vector indicating the evaluated quality of each recommended item\nat(i) under state su\nt is the embedding matrix of recommended items, Wc is the click weight\nmatrix, and bc is the corresponding bias term. Under the assumption that target rewards only relate\nto clicked items, the reward rt for (su\nrt(su\n\n(2)\nwhere Wr is the reward weight matrix, br is the corresponding bias term, and fr is the reward\nmapping function and can be set according to the reward de\ufb01nition in speci\ufb01c recommender systems.\nFor example, if we make rt the purchase of a clicked item ct, where rt = 1 if it is purchased and\nrt = 0 otherwise, fr can be realized by a Sigmoid function with binary output.\nBased on Eq (1) and (2), taking categorical reward, the user behavior model U can be estimated from\nthe of\ufb02ine data \u2126 via maximum likelihood estimation:\n\nt , at) is calculated by:\nt , at) = fr\n\n(cid:0)(Wrsu\n\nt + br)(cid:62)eu\n\n(cid:1),\n\nt\n\nlog p(ci\n\nt , ai\n\nt) + \u03bbp log p(ri\n\nt|sui\n\nt|sui\n\nt , ci\n\nt),\n\n(3)\n\n(cid:88)\n\n(cid:88)Ti\n\n\u03c4i\u2208\u2126\n\nt=0\n\nmax\n\nwhere \u03bbp is a parameter balancing the loss between click prediction and reward prediction, and Ti is\nthe length of the observation sequence \u03c4i. With a learnt user behavior model, user clicks and reward\non the recommendation list can be sampled from Eq (1) and (2) accordingly.\nAgent The agent should take actions based on the environment\u2019s provided states. However, in\npractice, users\u2019 states are not observable in a recommender system. Besides, as discussed in [23],\nthe states for the agent to take actions may be different from those for users to generate clicks and\nrewards. As a result, we build a different state model on the agent side in A to learn its states. Similar\nt\u22121}, we model states on the\nto that on the user side, given the projected click vectors {ea\nt denotes the state maintained by the agent at time t,\nagent side by sa\nha(\u00b7,\u00b7) is the chosen RNN cell. The initial state sa\n0 for the \ufb01rst recommendation is drawn from a\ndistribution \u03c1. We simply denote it as s0 in the rest of our paper. We should note that although the\nagent also models states based on users\u2019 click history, it might create different state sequences than\nthose on the user side.\nBased on the current state sa\nitems as its action at. The probability of item i to be included in at under the policy \u03c0 is:\n\nt , the agent generates a size-k recommendation list out of the entire set of\n\nt\u22121), where sa\n\nt = ha(sa\n\nt\u22121, ea\n\n2, ...ea\n\n0, ea\n\n\u03c0(i \u2208 at|sa\n\nt ) =\n\ni sa\nexp(Wa\nj=1 exp(Wa\n\nt + ba\ni )\nj sa\n\nt + ba\nj )\n\n,\n\n(4)\n\n(cid:80)|C|\n\n4\n\n\ud835\udc2c1\ud835\udc62\ud835\udc2c1\ud835\udc4e\ud835\udc500MLP\ud835\udc500\ud835\udc501\ud835\udc2c2\ud835\udc62\ud835\udc2c2\ud835\udc4e\ud835\udc501MLP\ud835\udc501\ud835\udc502MCSearch\ud835\udc9fDiscriminatorScore\ud835\udc2c0\ud835\udc62\ud835\udc2c0\ud835\udc4eMLP\ud835\udc500\ud835\udcb0\ud835\udc9cEnvironmentReward\ud835\udc9f(a) Interactive modeling and adversarial policy learning with a discriminator Real Offline Data\ud835\udf49Generated Data\u0ddc\ud835\udf49(b) Training of the discriminator\ud835\udc4e0\ud835\udc4e1\ud835\udc4e2\ud835\udc5f0\ud835\udc5f1\u2026\u2026\u2026\u2026\fi is the i-th row of the action weight matrix Wa, C is the entire set of recommendation\nwhere Wa\ncandidates, and ba\ni is the corresponding bias term. Following [2], we generate at by sampling without\nreplacement according to Eq (4). Unlike [3], we do not consider the combinatorial effect among the\nk items by simply assuming the users will evaluate them independently (as indicated in Eq (1)).\n\n0:t\u22121) by Eq (4), \u02c6ct = Uc(\u02c6\u03c4 c\n\nthe generated and of\ufb02ine data. When generating \u02c6\u03c40:t =(cid:8)(\u02c6a0, \u02c6c0, \u02c6r0), ..., (\u02c6at, \u02c6ct, \u02c6rt)(cid:9) for t > 0, we\n\n5 Adversarial Policy Learning\nWe use the policy gradient method REINFORCE [34] for the agent\u2019s policy learning, based on both\nobtain \u02c6at = A(\u02c6\u03c4 c\n0:t\u22121, \u02c6ct) by Eq (1).\n\u03c4 c represents clicks in the sequence \u03c4 and (\u02c6a0, \u02c6c0, \u02c6r0) is generated by sa\n0 accordingly. The\ngeneration of a sequence ends at the time t if \u02c6ct = cend, where cend is a stopping symbol. The\ndistributions of generated and of\ufb02ine data are denoted as g and data respectively. In the following\ndiscussions, we do not explicitly differentiate \u03c4 and \u02c6\u03c4 when the distribution of them is speci\ufb01ed.\nSince we start the training of U from of\ufb02ine data, it introduces inherent bias from the observations\nand our speci\ufb01c modeling choices. The bias affects the sequence generation and thus may cause\nbiased value estimation. To reduce the effect of bias, we apply adversarial training to control the\ntraining of both U and A. The discriminator is also used to rescale the generated rewards \u02c6r for policy\nlearning. Therefore, the learning of agent A considers both sequence generation and target rewards.\n\n0:t\u22121, \u02c6at) by Eq (2), and \u02c6rt = Ur(\u02c6\u03c4 c\n0 and su\n\n5.1 Adversarial training\n\nWe leverage adversarial training to encourage our IRecGAN model to generate high-quality sequences\nthat capture intrinsic patterns in the real data distribution. A discriminator D is used to evaluate a given\nsequence \u03c4, where D(\u03c4 ) represents the probability that \u03c4 is generated from the real recommendation\nenvironment. The discriminator can be estimated by minimizing the objective function:\n\n\u2212E\u03c4\u223cdata log(cid:0)D(\u03c4 )(cid:1) \u2212 E\u03c4\u223cg log(cid:0)1 \u2212 D(\u03c4 )(cid:1).\n(cid:26) 1\n\n(5)\nHowever, D only evaluates a completed sequence, and hence it cannot directly evaluate a partially\ngenerated sequence at a particular time step t. Inspired by [36], we utilize the Monte-Carlo tree\nsearch algorithm with the roll-out policy constructed by U and A to get sequence generation score at\neach time. At time t, the sequence generation score qD of \u03c40:t is de\ufb01ned as:\n\u2208 M CU ,A(\u03c40:t; N )\n\n(6)\nwhere M CU ,A(\u03c40:t; N ) is the set of N sequences sampled from the interaction between U and A.\nGiven the observations in of\ufb02ine data, U should generate clicks and rewards that re\ufb02ect intrinsic\npatterns of the real data distribution. Therefore, U should maximize the sequence generation objective\n0 , a0) \u00b7 qD(\u03c40:0)], which is the expected discriminator score for gen-\nEsu\nerating a sequence from the initial state. U may not generate clicks and rewards exactly the same as\nthose in of\ufb02ine data, but the similarity of its generated data to of\ufb02ine data is still an informative signal\nto evaluate its sequence generation quality. By setting qD(\u03c40:t) = 1 at any time t for of\ufb02ine data, we\nextend this objective to include of\ufb02ine data (it becomes the data likelihood function on of\ufb02ine data).\nFollowing [36], based on Eq (1) and Eq (2), the gradient of U\u2019s objective can be derived as,\n\n(cid:80)N\nn=1 D(\u03c4 n\nD(\u03c40:T )\n\n(a0,c0,r0)\u223cg U(c0, r0|su\n\n0 \u223c\u03c1[(cid:80)\n\nqD(\u03c40:t) =\n\nt < T\nt = T\n\n), \u03c4 n\n\n0:T\n\n0:T\n\nN\n\nE\u03c4\u223c{g,data}\n\nqD(\u03c40:t)\u2207\u0398u\n\nt , at) + \u03bbp log p\u0398u (rt|su\n\n,\n\nt=0\n\n(7)\nwhere \u0398u denotes the parameters of U and \u0398a denotes those of A. Based on our assumption,\neven when U can already capture users\u2019 true behavior patterns, it still depends on A to provide\nappropriate recommendations to generate clicks and rewards that the discriminator will treat as\nauthentic. Hence, A and U are coupled in this adversarial training. To encourage A to provide\nneeded recommendations, we include qD(\u03c40:t) as a sequence generation reward for A at time t as\nwell. As qD(\u03c40:t) evaluates the overall generation quality of \u03c40:t, it ignores sequence generations\nafter t. To evaluate the quality of a whole sequence, we require A to maximize the cumulative\n\nsequence generation reward E\u03c4\u223c{g,data}(cid:2)(cid:80)T\n\nthe observations in the interaction sequence, we approximate \u2207\u0398a\ncalculating the gradients. Putting these together, the gradient derived from sequence generations for\nA is estimated as,\n\nt=0 qD(\u03c40:t)(cid:3). Because A does not directly generate\n(cid:0)(cid:80)T\nt=0 qD(\u03c40:t)(cid:1) as 0 when\n\u03b3t(cid:48)\u2212tqD(\u03c40:t)(cid:1)\u2207\u0398a log \u03c0\u0398a (ct \u2208 at|sa\nt )(cid:3).\n\nE\u03c4\u223c{g,data}(cid:2)(cid:88)T\n\n(cid:0)(cid:88)T\n\n(8)\n\n(cid:0) log p\u0398u (ct|su\n\n(cid:104)(cid:88)T\n\nt , ct)(cid:1)(cid:105)\n\nt=0\n\nt(cid:48)=t\n\n5\n\n\fBased on our assumption that only the clicked items in\ufb02uence user behaviors, and U only generates\nrewards on the clicked items, we use \u03c0\u0398a (ct \u2208 at|sa\nt ), i.e., A should\npromote ct in its recommendation at time t. In practice, we add a discount factor \u03b3 < 1 when\ncalculating the cumulative rewards to reduce estimation variance [2].\n\nt ) as an estimation of \u03c0\u0398a (at|sa\n\n5.2 Policy learning\n\ncumulative reward E\u03c4\u223c{g,data}[RT ], where RT = (cid:80)T\n\nBecause our adversarial training encourages IRecGAN to generate clicks and rewards with similar\npatterns as of\ufb02ine data, and we assume rewards only relate to the clicked items, we use of\ufb02ine data\nas well as generated data for policy learning. Given data \u03c40:T = {(a0, c0, r0), ..., (aT , cT , rT )},\nincluding both of\ufb02ine and generated data, the objective of the agent is to maximize the expected\nIn the generated data, due to the\ndifference in distributions of the generated and of\ufb02ine sequences, the generated reward \u02c6rt calculated\nby Eq (2) might be biased. To reduce such bias, we utilize the sequence generation score in Eq (6)\nto rescale the generated rewards: rs\nt = qD(\u03c40:t)\u02c6rt, and treat it as the reward for generated data. The\ngradient of the objective is thus estimated by:\n\nt=0 rt.\n\nRt\u2207\u0398a log \u03c0\u0398a (ct \u2208 at|sa\n\nt=0\n\n(9)\nRt is an approximation of RT with the discount factor \u03b3. Overall, the user behavior model U is\nupdated only by the sequence generation objective de\ufb01ned in Eq (7) on both of\ufb02ine and generated\ndata; but the agent A is updated by both sequence generation and target rewards. Hence, the overall\nreward for A at time t is qD(\u03c40:t)(1 + \u03bbrrt), where \u03bbr is the weight for cumulative target rewards.\nThe overall gradient for A is thus:\n\nt(cid:48)=t\n\n\u03b3t(cid:48)\u2212tqD(\u03c40:t)rt\n\nE\u03c4\u223c{g,data}(cid:2)(cid:88)T\n\nt )(cid:3), Rt =\n\n(cid:88)T\n\nt \u2207\u0398a log \u03c0\u0398a (ct \u2208 at|sa\nRa\n\n\u03b3t(cid:48)\u2212tqD(\u03c40:t)(1 + \u03bbrrt) (10)\n\nE\u03c4\u223c{g,data}(cid:2)(cid:88)T\n\nt=0\n\nt )(cid:3), Ra\n\nt =\n\n(cid:88)T\n\nt(cid:48)=t\n\n6 Theoretical Analysis\nFor one iteration of policy learning in IRecGAN, we \ufb01rst train the discriminator D with of\ufb02ine data,\nwhich follows Pdata and was generated by an unknown logging policy, and the data generated by\nIRecGAN under \u03c0\u0398a with the distribution of g. When \u0398u and \u0398a are learnt, for a given sequence \u03c4,\nby proposition 1 in [9], the optimal discriminator D is D\u2217(\u03c4 ) =\nSequence generation Both A and U contribute to the sequence generation in IRecGAN. U is updated\nby the gradient in Eq (7) to maximize the sequence generation objective. At time t, the expected se-\nquence generation reward for A on the generated data is: E\u03c40:t\u223cg[qD(\u03c40:t)] = E\u03c40:t\u223cg[D(\u03c40:T|\u03c40:t)].\nThe expected value on \u03c40:t is: E\u03c4\u223cg[Vg] = E\u03c4\u223cg\nGiven the optimal D\u2217, the sequence generation value can be written as:\nPdata(\u03c40:T|\u03c40:t)\n\n(cid:2)(cid:80)T\nt=0 qD(\u03c40:t)(cid:3) =(cid:80)T\n\n(cid:2)D(\u03c40:T|\u03c40:t)(cid:3).\n\nPdata(\u03c4 )+Pg(\u03c4 ).\n\n(cid:88)T\n\nE\u03c40:t\u223cg\n\nPdata(\u03c4 )\n\n(cid:105)\n\n(cid:104)\n\nt=0\n\nE\u03c4\u223cg[Vg] =\n\nE\u03c40:t\u223cg\n\nt=0\n\nPdata(\u03c40:T|\u03c40:t) + Pg(\u03c40:T|\u03c40:t)\n\n.\n\n(11)\n\nMaximizing each term in the summation of Eq (11) is an objective for the generator at time t in\nGAN. According to [9], the optimal solution for all such terms is Pg(\u03c40:T|s0) = Pdata(\u03c40:T|s0). It\nmeans A can maximize the sequence generation value when it helps to generate sequences with the\nsame distribution as data. Besides the global optimal, Eq (11) also encourages A to reward each\nPg(\u03c40:T|\u03c40:t) = Pdata(\u03c40:T|\u03c40:t), even if \u03c40:t is less likely to be generated from Pg. This prevents\nIRecGAN to recommend items only considering users\u2019 immediate preferences.\nValue estimation The agent A should also be updated to maximize the expected value of target\nrewards Va. To achieve this, we use discriminator D to rescale the estimation of Va on the generated\nsequences, and we also combine of\ufb02ine data to evaluate Va for policy \u03c0\u0398a:\n\nE\u03c40:t\u223cg\n\nE\u03c4\u223c\u03c0\u0398a\n\n[Va] = \u03bb1\n\n(12)\nwhere \u02c6rt is the generated reward by U at time t and rt is the true reward. \u03bb1 and \u03bb2 represent the\nratio of generated data and of\ufb02ine data during model training, and we require \u03bb1 + \u03bb2 = 1. Here we\nsimplify P (\u03c40:T|\u03c40:t) as P (\u03c40:t). As a result, there are three sources of biases in this value estimation:\n\nPdata(\u03c40:t) + Pg(\u03c40:t)\n\n\u02c6rt + \u03bb2\n\nt=0\n\nt=0\n\nE\u03c40:t\u223cdatart,\n\nPdata(\u03c40:t)\n\n\u2206 = \u02c6rt \u2212 rt,\n\n\u03b41 = 1 \u2212 P\u03c0\u0398a\n\n\u03b42 = 1 \u2212 P\u03c0\u0398a\n\n(\u03c40:t)/Pdata(\u03c40:t).\n\n(cid:88)T\n\n(cid:88)T\n\n(\u03c40:t)/Pg(\u03c40:t),\n\n6\n\n\fBased on different sources of biases, the expected value estimation in Eq (12) is:\n\nT(cid:88)\n\n(cid:16) P\u03c0\u0398a\n\n(\u03c40:t)\nPdata(\u03c40:t)\n\n(cid:17)\n\nrt\n\n+ \u03b42\n\nE\u03c40:t\u223c\u03c0\u0398a\n\n(\u03bb1 \u2212 wt)rt,\n\nE\u03c4\u223c\u03c0\u0398a\n\n[Va] =\u03bb1\n\nT(cid:88)\n\nt=0\n\nE\u03c40:t\u223cg\n\nT(cid:88)\n\n(\u03c40:t)\nP\u03c0\u0398a\nPg(\u03c40:t)\n\n\u2206 + rt\n\n2 \u2212 (\u03b41 + \u03b42)\n\nT(cid:88)\n\nE\u03c40:t\u223cdata\n\n+ \u03bb2\n\nE\u03c40:t\u223cdata\u03bb2\u03b42rt \u2212 T(cid:88)\n\nt=0\n\n=V\n\n\u03c0\u0398a\na\n\n+\n\nE\u03c40:t\u223c\u03c0\u0398a\n\nwt\u2206 +\n\nt=0\n\nt=0\n\nt=0\n\n\u03bb1\n\n2\u2212(\u03b41+\u03b42). \u2206 and \u03b41 come from the bias of user behavior model U. Because the\nwhere wt =\nadversarial training helps improve U to capture real data patterns, it decreases \u2206 and \u03b42. Because we\ncan adjust the sampling ratio \u03bb1 to reduce wt, wt\u2206 can be small. The sequence generation rewards\nfor agent A encourage distribution g to be close to data. Because \u03b42 = 1 \u2212 P\u03c0\u0398a\n\u00b7 Pg(\u03c40:t)\nPdata(\u03c40:t),\nthe bias \u03b42 can also be reduced. It shows our method has a bias controlling effect.\n\nPg(\u03c40:t)\n\n(\u03c40:t)\n\n7 Experiments\n\nIn our theoretical analysis, we can \ufb01nd that reducing the model bias improves value estimation,\nand therefore improves policy learning. In this section, we conduct empirical evaluations on both\nreal-world and synthetic datasets to demonstrate that our solution can effectively model the pattern of\ndata for better recommendations, compared with state-of-the-art solutions.\n\n7.1 Simulated Online Test\n\nSubject to the dif\ufb01culty of deploying a recommender system with real users for online evaluation, we\nuse simulation-based studies to \ufb01rst investigate the effectiveness of our approach following [37, 3].\nSimulated Environment We synthesize an MDP to simulate an online recommendation environment.\nIt has m states and n items for recommendation, with a randomly initialized transition probability\nmatrix P (s \u2208 S|aj \u2208 A, si \u2208 S). Under each state si, an item aj\u2019s reward r(aj \u2208 A|si \u2208 S) is\nuniformly sampled from the range of 0 to 1. During the interaction, given a recommendation list\nincluding k items selected from the whole item set by an agent, the simulator \ufb01rst samples an item\nproportional to its ground-truth reward under the current state si as the click candidate. Denote\nthe sampled item as aj, a Bernoulli experiment is performed on this item with r(aj) as the success\nprobability; then the simulator moves to the next state according to the state transition probability\np(s|aj, si). A special state s0 is used to initialize all the sessions, which do not stop until the Bernoulli\nexperiment fails. The immediate reward is 1 if the session continues to the next step; otherwise 0. In\nour experiment, m, n and k are set to 10, 50 and 10 respectively.\nOf\ufb02ine Data Generation We generate of\ufb02ine recommendation logs denoted by doff with the simula-\ntor. The bias and variance in doff are especially controlled by changing the logging policy and the size\nof doff. We adopt three different logging policies: 1) uniformly random policy \u03c0random, 2) maximum\nreward policy \u03c0max, 3) mixed reward policy \u03c0mix. Speci\ufb01cally, \u03c0max recommends the top k items with\nthe highest ground-truth reward under the current simulator state at each step, while \u03c0mix randomly\nselects k items with either the top 20%-50% ground-truth reward or the highest ground-truth reward\nunder a given state. In the meanwhile, we vary the size of data in doff from 200 to 10,000.\nBaselines We compared our IRecGAN with the following baselines: 1) LSTM: only the user\nbehavior model trained on of\ufb02ine data; 2) PG: only the agent model trained by policy gradient on\nof\ufb02ine data; 3) LSTMD: the user behavior model in IRecGAN, updated by adversarial training.\nExperiment Settings The hyper-parameters in all models are set as follows: the item embedding\ndimension is set to 50, the discount factor \u03b3 in value calculation is set to 0.9, the scale factors \u03bbr and\n\u03bbp are set to 3 and 1. We use 2-layer LSTM units with 512-dimension hidden states. The ratio of\ngenerated training samples and of\ufb02ine data for each training epoch is set to 1:10. We use an RNN\nbased discriminator in all experiments with details provided in the appendix.\nOnline Evaluation After training our models and baselines on doff, we deploy the learned policy to\ninteract with the simulator for online evaluation. We calculated coverage@r to measure the proportion\nof the true top r relevant items that are ranked in the top k recommended items by a model across\nall time steps (details in the appendix). The results of coverage@r under different con\ufb01gurations\nof of\ufb02ine data generation are reported in Figure 2. Under \u03c0random, coverage@r of all algorithms are\nrelatively low when r is large and the difference in overall performance between behavior and agent\n\n7\n\n\f(a) \u03c0random(200)\n\n(b) \u03c0random(10, 000)\n\n(c) \u03c0max(10, 000)\n\n(d) \u03c0mix(10, 000)\n\nFigure 2: Online evaluation results of coverage@r and cumulative rewards.\n\nFigure 3: Online learning results of coverage@1 and coverage@10.\n\nmodels is not very large. This suggests the dif\ufb01culty of recognizing high reward items under \u03c0random,\nbecause every item has an equal chance to be observed (i.e., full exploration) especially with a small\nsize of of\ufb02ine data. However, under \u03c0max and \u03c0mix, when the high reward items can be suf\ufb01ciently\nlearned, user behavior models (LSTM, LSTMD) fail to capture the overall preferred items while agent\nmodels (PG, IRecGAN) are stable to the change of r. IRecGAN shows its advantage especially under\n\u03c0mix, which requires a model to differentiate top relevant items from those with moderate reward. It\nhas close coverage@r to LSTM when r is small and better captures users\u2019 overall preferences when\nuser behavior models fail seriously. When rewards can not be suf\ufb01ciently learned (Fig 2(a)), our\nmechanism can strengthen the in\ufb02uence of truly learned rewards (LSTMD outperforms LSTM when\nr is small) but may also underestimate some bias. However, when it is feasible to estimate the reward\ngeneration (Fig 2(b)(c)(d)), both LSTMD and IRecGAN outperform baselines in coverage@r under\nthe help of generating samples via adversarial training.\nThe average cumulative rewards are also reported in the rightmost bars of Figure 2. They are calculated\nby generating 1000 sequences with the environment and take the average of their cumulative rewards.\nIRecGAN has a larger average cumulative reward than other methods under all con\ufb01gurations except\n\u03c0random with 10,000 of\ufb02ine sequences. Under \u03c0random(10, 000) IRecGAN outperforms PG but not\nLSTMD. The low cumulative reward of PG under \u03c0random indicates that the transition probabilities\nconditioned on high rewarded items may not be suf\ufb01ciently learned under the random of\ufb02ine policy.\nOnline Learning To evaluate our model\u2019s effectiveness in a more practical setting, we execute online\nand of\ufb02ine learning alternately. Speci\ufb01cally, we separate the learning into two stages: \ufb01rst, the\nagents can directly interact with the simulator to update their policies, and we only allow them to\ngenerate 200 sequences in this stage; then they turn to the of\ufb02ine stage to reuse their generated data\nfor of\ufb02ine learning. We iterate the two stages and record their performance in the online learning\nstage. We compare with the following baselines: 1) PG-online with only online learning, 2) PG-\nonline&of\ufb02ine with online learning and reusing the generated data via policy gradient for of\ufb02ine\n\n8\n\n12345678910rewardr0.30.40.50.6coverage@rLSTMLSTMDPGIRecGAN2.93.03.13.23.33.43.53.6reward12345678910rewardr0.400.450.500.550.600.65coverage@rLSTMLSTMDPGIRecGAN3.03.23.43.63.84.04.2reward12345678910rewardr0.30.40.50.60.7coverage@rLSTMLSTMDPGIRecGAN3.03.23.43.63.84.04.2reward12345678910rewardr0.30.40.50.6coverage@rLSTMLSTMDPGIRecGAN2.93.03.13.23.33.43.53.6reward05101520iteration0.150.200.250.300.350.40coverage@1LSTM-offlinePG-onlinePG-online&offlineIRecGAN-online&offline05101520iteration0.150.200.250.300.35coverage@10LSTM-offlinePG-onlinePG-online&offlineIRecGAN-online&offline\flearning, and 3) LSTM-of\ufb02ine with only of\ufb02ine learning. We train all the models from scratch and\nreport the performance of coverage@1 and coverage@10 over 20 iterations in Figure 3. We can\nobserve that LSTM-of\ufb02ine performs worse than other RL methods with of\ufb02ine learning, especially\nin the later stage, due to its lack of exploration. PG-online improves slowly as it does not reuse the\ngenerated data. Compared with PG-online&of\ufb02ine, IRecGAN has better convergence and coverage\nbecause of its reduced value estimation bias. We also \ufb01nd that coverage@10 is harder to improve.\nThe key reason is that as the model identi\ufb01es the items with high rewards, it tends to recommend\nthem more often. This gives less relevant items less chance to be explored, which is similar to our\nonline evaluation experiments under \u03c0max and \u03c0mix. Our model-based RL training alleviates this\nbias to a certain extent by generating more training sequences, but it cannot totally alleviate it. This\nreminds us to focus on explore-exploit trade-off in model-based RL in our future work.\n\n7.2 Real-world Data Of\ufb02ine Test\nWe use a large-scale real-world recommendation dataset from CIKM Cup 2016 to evaluate the\neffectiveness of our proposed solution for of\ufb02ine reranking. Sessions of length 1 or longer than 40\nand items that have never been clicked are \ufb01ltered out. We selected the top 40,000 most popular\nitems into the recommendation candidate set, and randomly selected 65,284/1,718/1,720 sessions\nfor training/validation/testing. The average length of sessions is 2.81/2.80/2.77 respectively; and the\nratio of clicks which lead to purchases is 2.31%/2.46%/2.45%. We followed the same model setting\nas in our simulation-based study in this experiment. To understand of the effect of different data\nseparation strategies on RL model training and test, we also provide a comparison of performances\nunder different data separation strategies in the appendix.\nBaselines In addition to the baselines we compared in our simulation-based study, we also include\nthe following state-of-the-art solutions for recommendation: 1). PGIS: the agent model estimated\nwith importance sampling on of\ufb02ine data to reduce bias; 2). AC: an LSTM model whose setting is\nthe same as our agent model but trained with actor-critic algorithm [16] to reduce variance; 3). PGU:\nthe agent model trained using of\ufb02ine and generated data, without adversarial training; 4). ACU: AC\nmodel trained with both of\ufb02ine and generated data, without adversarial training.\nEvaluation Metrics All the models were applied to rerank the given recommendation list at each\nstep of testing sessions in of\ufb02ine data. We used Precision@k (P@1 and P@10) to compare different\nmodels\u2019 recommendation performance, where we de\ufb01ne the clicked items as relevant. Because the\nlogged recommendation list was not ordered, we cannot assess the logging policy\u2019s performance here.\n\nModel\n\nP@10 (%)\nP@1 (%)\n\nTable 1: Rerank evaluation on real-world dataset with random splitting.\nLSTM\nACU\n32.43\u00b10.22\n32.89\u00b10.50\n6.63\u00b1 0.29\n8.20\u00b10.65\n\nLSTMD\n33.42\u00b10.40\n8.55\u00b10.63\n\n31.93\u00b10.17\n6.54\u00b10.19\n\nPGIS\n\n28.13\u00b10.45\n4.61\u00b10.73\n\nPGU\n\n34.12\u00b10.52\n6.44\u00b10.56\n\nPG\n\n33.28\u00b10.71\n6.25\u00b10.14\n\nAC\n\nIRecGAN\n35.06\u00b10.48\n6.79\u00b10.44\n\nResults The results of the of\ufb02ine rerank evaluation are reported in Table 1. With the help of\nadversarial training, IRecGAN achieved encouraging P@10 improvement against all baselines. This\nveri\ufb01es the effectiveness of our model-based reinforcement learning, especially its adversarial training\nstrategy for utilizing the of\ufb02ine data with reduced bias. Speci\ufb01cally, PGIS did not perform as well\nas PG partially because of the high variance introduced by importance sampling. PGU was able\nto \ufb01t the given data more accurately than PG by learning from the generated data, since there are\nmany items for recommendation and the collected data is limited. However, PGU performed worse\nthan IRecGAN because of the biased user behavior model. And with the help of the discriminator,\nIRecGAN reduces the bias in the user behavior model to improve value estimation and policy learning.\nThis is also re\ufb02ected on its improved user behavior model: LSTMD outperformed LSTM, given both\nof them are for user behavior modeling.\n\n8 Conclusion\nIn this work, we developed a practical solution for utilizing of\ufb02ine data to build a model-based\nreinforcement learning solution for recommendation. We introduce adversarial training for joint user\nbehavior model learning and policy update. Our theoretical analysis shows our solution\u2019s promise in\nreducing bias; our empirical evaluations in both synthetic and real-world recommendation datasets\nverify the effectiveness of our solution. Several directions left open in our work, including balancing\nexplore-exploit in policy learning with of\ufb02ine data, incorporating richer structures in user behavior\nmodeling, and exploring the applicability of our solution in other off-policy learning scenarios, such\nas conversational systems.\n\n9\n\n\fReferences\n[1] Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization.\nIn Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages\n22\u201331. JMLR. org, 2017.\n\n[2] Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, and Ed H Chi. Top-k\noff-policy correction for a reinforce recommender system. In Proceedings of the Twelfth ACM\nInternational Conference on Web Search and Data Mining, pages 456\u2013464. ACM, 2019.\n\n[3] Xinshi Chen, Shuang Li, Hui Li, Shaohua Jiang, Yuan Qi, and Le Song. Generative adversarial\nuser model for reinforcement learning based recommendation system. In Proceedings of the\n36th International Conference on Machine Learning, volume 97, pages 1052\u20131061, 2019.\n\n[4] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation\nof gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555,\n2014.\n\n[5] Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-ef\ufb01cient approach\nto policy search. In Proceedings of the 28th International Conference on machine learning\n(ICML-11), pages 465\u2013472, 2011.\n\n[6] Marc Peter Deisenroth, Carl Edward Rasmussen, and Dieter Fox. Learning to control a low-cost\n\nmanipulator using data-ef\ufb01cient reinforcement learning. 2011.\n\n[7] Marc Peter Deisenroth, Gerhard Neumann, Jan Peters, et al. A survey on policy search for\n\nrobotics. Foundations and Trends R(cid:13) in Robotics, 2(1\u20132):1\u2013142, 2013.\n\n[8] Alexandre Gilotte, Cl\u00e9ment Calauz\u00e8nes, Thomas Nedelec, Alexandre Abraham, and Simon\nDoll\u00e9. Of\ufb02ine a/b testing for recommender systems. In Proceedings of the Eleventh ACM\nInternational Conference on Web Search and Data Mining, pages 198\u2013206. ACM, 2018.\n\n[9] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[10] Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep q-learning\nIn International Conference on Machine Learning, pages\n\nwith model-based acceleration.\n2829\u20132838, 2016.\n\n[11] Xiangnan He, Hanwang Zhang, Min-Yen Kan, and Tat-Seng Chua. Fast matrix factorization\nfor online recommendation with implicit feedback. In Proceedings of the 39th International\nACM SIGIR conference on Research and Development in Information Retrieval, pages 549\u2013558.\nACM, 2016.\n\n[12] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):\n\n1735\u20131780, 1997.\n\n[13] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recom-\n\nmender systems. Computer, (8):30\u201337, 2009.\n\n[14] Reinforcement Learning. An introduction, richard s. sutton and andrew g. barto, 1998.\n\n[15] Elad Liebman, Maytal Saar-Tsechansky, and Peter Stone. Dj-mc: A reinforcement-learning\nagent for music playlist recommendation. In Proceedings of the 2015 International Conference\non Autonomous Agents and Multiagent Systems, pages 591\u2013599. International Foundation for\nAutonomous Agents and Multiagent Systems, 2015.\n\n[16] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,\nDavid Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv\npreprint arXiv:1509.02971, 2015.\n\n[17] Zhongqi Lu and Qiang Yang. Partially observable markov decision process for recommender\n\nsystems. arXiv preprint arXiv:1608.07793, 2016.\n\n10\n\n\f[18] David Meger, Juan Camilo Gamboa Higuera, Anqi Xu, Philippe Giguere, and Gregory Dudek.\nLearning legged swimming gaits from experience. In 2015 IEEE International Conference on\nRobotics and Automation (ICRA), pages 2332\u20132338. IEEE, 2015.\n\n[19] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan\nWierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint\narXiv:1312.5602, 2013.\n\n[20] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G\nBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.\nHuman-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.\n\n[21] Jun Morimoto and Christopher G Atkeson. Minimax differential dynamic programming: An\napplication to robust biped walking. In Advances in neural information processing systems,\npages 1563\u20131570, 2003.\n\n[22] R\u00e9mi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. Safe and ef\ufb01cient\noff-policy reinforcement learning. In Advances in Neural Information Processing Systems,\npages 1054\u20131062, 2016.\n\n[23] Junhyuk Oh, Satinder Singh, and Honglak Lee. Value prediction network. In Advances in\n\nNeural Information Processing Systems, pages 6118\u20136128, 2017.\n\n[24] Baolin Peng, Xiujun Li, Jianfeng Gao, Jingjing Liu, Kam-Fai Wong, and Shang-Yu Su. Deep\ndyna-q: Integrating planning for task-completion dialogue policy learning. arXiv preprint\narXiv:1801.06176, 2018.\n\n[25] Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department\n\nFaculty Publication Series, page 80, 2000.\n\n[26] Doina Precup, Richard S Sutton, and Sanjoy Dasgupta. Off-policy temporal-difference learning\n\nwith function approximation. In ICML, pages 417\u2013424, 2001.\n\n[27] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region\npolicy optimization. In International Conference on Machine Learning, pages 1889\u20131897,\n2015.\n\n[28] Guy Shani, David Heckerman, and Ronen I Brafman. An mdp-based recommender system.\n\nJournal of Machine Learning Research, 6(Sep):1265\u20131295, 2005.\n\n[29] Richard S Sutton.\n\nIntegrated architectures for learning, planning, and reacting based on\napproximating dynamic programming. In Machine Learning Proceedings 1990, pages 216\u2013224.\nElsevier, 1990.\n\n[30] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient\nIn Advances in neural\n\nmethods for reinforcement learning with function approximation.\ninformation processing systems, pages 1057\u20131063, 2000.\n\n[31] Adith Swaminathan and Thorsten Joachims. Batch learning from logged bandit feedback\nthrough counterfactual risk minimization. Journal of Machine Learning Research, 16(1):\n1731\u20131755, 2015.\n\n[32] Adith Swaminathan and Thorsten Joachims. The self-normalized estimator for counterfactual\n\nlearning. In advances in neural information processing systems, pages 3231\u20133239, 2015.\n\n[33] Philip Thomas and Emma Brunskill. Data-ef\ufb01cient off-policy policy evaluation for reinforce-\n\nment learning. In International Conference on Machine Learning, pages 2139\u20132148, 2016.\n\n[34] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce-\n\nment learning. Machine learning, 8(3-4):229\u2013256, 1992.\n\n[35] Qingyun Wu, Hongning Wang, Liangjie Hong, and Yue Shi. Returning is believing: Optimizing\nlong-term user engagement in recommender systems. In Proceedings of the 2017 ACM on\nConference on Information and Knowledge Management, pages 1927\u20131936. ACM, 2017.\n\n11\n\n\f[36] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial\n\nnets with policy gradient. In Thirty-First AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\n[37] Xiangyu Zhao, Long Xia, Yihong Zhao, Dawei Yin, and Jiliang Tang. Model-based reinforce-\n\nment learning for whole-chain recommendations. arXiv preprint arXiv:1902.03987, 2019.\n\n[38] Guanjie Zheng, Fuzheng Zhang, Zihan Zheng, Yang Xiang, Nicholas Jing Yuan, Xing Xie,\nand Zhenhui Li. Drn: A deep reinforcement learning framework for news recommendation.\nIn Proceedings of the 2018 World Wide Web Conference on World Wide Web, pages 167\u2013176.\nInternational World Wide Web Conferences Steering Committee, 2018.\n\n12\n\n\f", "award": [], "sourceid": 5724, "authors": [{"given_name": "Xueying", "family_name": "Bai", "institution": "Stony Brook University"}, {"given_name": "Jian", "family_name": "Guan", "institution": "Tsinghua University"}, {"given_name": "Hongning", "family_name": "Wang", "institution": "University of Virginia"}]}