NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
Originality: The problem studied in this paper ("how to make an autonomous agent learn to gain maximal control of its environment") is not novel, with much prior work (see [Klyubin 2005] for a good overview). While the algorithm itself is novel, it seems like a bit of an ad hoc combination of ideas: bandits, curiosity, goal-conditioned RL, planning. Said another way, all sufficiently complex methods are likely to be original, but that in no way guarantees that they will be useful. Quality: While I laud the authors for including some baselines in their experiments, I think they were remiss to not include any intrinsic motivation baselines or planning baselines. One simple baseline would be to use SAC to maximize reward + surprise. Exploration bonuses such as RND [Burda 2018] and ICM [Pathak 2017] are good candidates as well. Given that the environments tested are relatively simple, I expect planning algorithms such as iLQR or PRMs could work well. Clarity: In terms of experiments, the method is complex enough that I don't think I'd be able to reproduce the experiments without seeing code. Code was not provided. The writing is vague in a number of places, and contains many typos (see list below). Significance: I agree with the authors that we must develop control algorithms that work well when provided only limited supervision. However, my impression is that this method actually requires quite a bit of supervision, first in specifying the goal spaces, again in defining the distance metric for each (L121), again in defining the procedure for sampling from each, and finally for tuning hyperparameters. I suspect that a method with 8 components has a large number of hyperparameters.
Reviewer 2
This paper introduces an algorithm to learn how to execute tasks that might depend on other tasks being executed first. Each task is a goal-conditioned problem where the agent needs to bring a specific part (coordinates) of the observed state to a given target. The challenging component is that, to modify some coordinates of the state (like the position of a heavy object), the agent first needs to modify other coordinates (move itself to a tool, then bring the tool to the heavy object). Therefore, methods that always set goals in the full state-space like HER or HIRO don’t leverage anyhow the underlying structure and might considerably struggle with this environments. This is an important problem to solve, and the assumptions are reasonable. Instead, the algorithm the authors propose fully leverages this structure, even if they only assume that the coordinates split is given (not their ordering or relationship). - First they have a “task selector” that picks which of the coordinate sets (ie, task) to generate a random goal in. This task selector is trained as a bandit with a learning progress reward. - The goal is then passed to a subtask planner that decides if another task should be executed before. To do so, a directed dependency graph of all task is built, where the weight on the edge between task i and j is the success on executing task i when j was exectued before (in previous attempts). The outgoing edges from node i are normalized, and an epsilon-greedy strategy is used to sample previous tasks. - Once the sub-task to execute is decided, a “goal proposal network” gives what were the previously successfull transition states. This network is regressed on previous rollouts, with some batch balancing to upweight the positive examples. - Finally, the actual goal is passed to a policy that is trained with SAC or HER. All points above also use a surprise component that seems to be critical for this setup to perform well. For that, a forward model is learned, and when the prediction error exceeds a fixed threshold the event is considered surprising. I would say that each individual piece is not drastically novel, but this is the first time I see all the components working successfully together. The clarity of the method explanation could be improved though. Their results are quite impressive, even compared to other hierarchical algorithms like HIRO (albeit none of their baselines leverages all the assumptions on the task structure as they do). They claim that learning progress is better than prediction error as intrinsic motivation (IM) because “it renders unsolvable tasks uninteresting as soon as progress stalls”. This is not convincing, because even in the cases they describe where an object can’t be moved, a model-error driven IM will stop trying to move it once it understands it can’t be moved. Furthermore, it seems from their results that only using the surprise, and no other type of IM is already almost as good. I would argue that the strength of their method is in the architecture, rather than in one IM or another. It might be better to tune down a bit the claims on which IM is better. Overall, it is a slightly involved architecture, but that powerfully leverages the given assumptions. The paper needs some serious proof-reading, and the presentation and explanations could also be drastically improved. Still, I think this is a good paper, with good enough experiments, analysis and ablations. It would be interesting to push this algorithm to its limit, by increasing the number of tasks available, making the task graph more challenging than a simple chain, adding more stochasticity to the environment, etc.
Reviewer 3
Note: the website links are switched around ("code" links to video). This paper proposes a method for intrinsic motivation in hierarchical task environments by combining surprise-based rewards with relational reasoning. The problem that is addressed is an important one, as RL often fails in long horizon, multistep tasks. The idea of "controlling what you can" is an intuitive and appealing approach to thinking about these kinds of environments. CWYC assumes that the state space has been partitioned into different tasks. This assumption is not entirely unreasonable given the power of object detectors and localization. Method: CWYC has many components that optimize different objectives. The task selector component is maximizing learning progress. A task graph learns which tasks are dependent on which, and selects a path through the graph. Subgoals are generated for each subtask with an attention process over all states, with different parameters per task. Goal conditioned policies for each task. Originality: No single component in this method is orignal, but their combination is. I found it to be an interesting way to use both intrinsic motivation and object relations. This method seems to provide a very good curriculum for learning complex tasks. Quality: The experiments are well documented and the method outperforms other methods, but it does not compare to other intrinsic motivation methods. Clarity: Good. Significance: Overall, I find the paper significant. While the assumption of a partitioned state space is a lot, it is not unreasonable with tools such as object detectors. The goal proposal structured output should be discussed more in the main text. Potential Issues: - The task graph is not a function of the particular goal in the final task: this means that it may be unable to choose a good sequence of subtasks if the sequence should depend on the particular goal (if you want the heavy object to go left, you need tool 1, but to move it right, you need tool 2). - The subgoal attention requires attending over all possible goals, which could be a problem in continuous or high dimensional state spaces. - The experiments should compare to "Curiosity-driven Exploration by Self-supervised Prediction"