Reviews: Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function

The submission considers episodic online learning of MDPs with unknown transition function and bandit feedback. A no-regret algorithm is provided in the adversarial model where the loss function for each episode can be arbitrary. However, the result is limited to the special case where the MDP is loop-free. The algorithm is based on a previous algorithm for unknown transition function but full-information feedback. The problem is important and challenging, so the loop-free case is a reasonable first step. The author feedback clarified some issues the reviewers had in particular regarding the presentation. After discussion, the reviewers all vote for accepting the submission, although their opinions are not very strong in light of the limited contribution and some weakness in the presentation.

Paper ID:	1308
Title:	Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function