Reviews: Linear Stochastic Bandits Under Safety Constraints

This paper considers a linear stochastic multi-armed bandit (MAB) problem with safety constraints, which are linear on the unknown parameter vector $\mu$. The motivation and importance of such problem comes from bandit applications to safety-critical systems, where respecting the system constraints is critical, yet rely on the bandit's unknown parameters. The authors clearly introduce, motivate and present their setting throughout Section 1. In section 2, the authors present a safe version of UCB for contextual linear bandits, where they aim at minimizing the T-period cumulative regret, while ensuring that the linear constraint is met. The proposed algorithm (Safe-UCB) is clearly described, with a well organized presentation. Safe-UCB constructs a confidence set per bandit interaction, within which the unknown parameter $\mu$ lies with high probability; then finds candidates within that set that are safe, and plays the action within such safe decision set that minimizes the loss function. Due to the uncertainty about the parameter $\mu$, the linear safety constraint restricts the choice of actions, which implies a tradeoff between learning the safe set and minimizing the regret. This tradeoff is analyzed in Section 3, where regret bounds to characterize the effect of the safety constraints on regret are presented. The regret analysis induced by safety constraints is delicate, which they upper bound in Theorems 2 and 3. For the case when delta (the safety margin) is bigger than zero and known, the authors present a regret bound of order $O(\sqrt(T))$, while when the safety gap is zero, the provided bound is of order $O(T^{2/3}\log(T)). The main takeaway is that exploration is mandatory, so that the safe set contains the true optimal action, and good estimates of $\mu$ can be attained. The authors argue that the effect of the safety requirements on regret can be reduced for large T. However, it is unclear how practical the proposed $T$s are (i.e., how long would that horizon be in real bandits). The exploration horizon $T^\prime$ is of order $\log(T)$, and grows larger as the normalized safety gap $\Delta$/|B|$ becomes smaller. These quantities depend on all the parameters of the problem, such as the minimum eigenvalue of the covariance of actions and $\delta$, and might be hard to compute in practice. Besides, Safe-UCB as in Algorithm 1 requires knowledge not only of the safety constraints (which makes sense in practice), but also, to be able to compute $\beta_t$ in Eqn. (5), of the bounds on $\mu$, $x$, and the noise tails ($S$, $L$, and $R$, respectively). The authors agreed in their rebuttal that their proposed method requires all these constants as input, and I encourage them to explicitly indicate so in the presentation of Algorithm 1. Besides, the authors should include in the updated manuscript how to compute $\lambda_{minus}$, as explained in their rebuttal. Comments on the practicality of these values (e.g., by providing an illustrative example) would further improve the quality of the paper. Finally, the authors study the most impactful setting, one where there is no a-priori knowledge of the value of the safety gap. The authors argue that this general case is hard to address, since a bound on the unknown $\Delta$ might imply a too big $T^\prime$ in practice: lot of exploration might be necessary until the safety set is guaranteed to contain the true optimal. Therefore, they present an alternative heuristic modification of Safe-LUCB, which empirically approaches the same regret (and has a worst case bound that matches Safe-UCB with $\Delta=0$). The key idea is to construct a lower confidence bound for the safety gap and calculate the length of the exploration phase associated with it. However, this step is nontrivial. The authors show how to compute such bounds for the finite K-armed linear bandit setting in the appendix. Both of their proposed safe algorithms require solving Equation (8), which is a non-convex optimization problem in general. The authors do clarify this in the appendix (and agree in their rebuttal to state so more clearly in the updated manuscript), and show how, for a relaxed problem with polytope decision sets, a simpler confidence region in Line 8 of Algorithm 1 can be computed. However, the modified approach (which is now computationally tractable) contains a $\sqrt(d)$ term, resulting on the regret of the modified algorithm scaling with dimension d. Providing a discussion on the computational issues and cost of their algorithm in the main manuscript, with comments on the implication of a polytope decision set on their algorithm's regret would be of interest. Finally, the authors provide illustrative simulation results of their proposed algorithms. The results, although limited to only 20 realizations, validate their claims --- the authors agree to update Figure 1 including the variance of the performance (i.e., error bars). Besides, beyond a more elaborate comparison with prior work in the background section, it would be great if the authors could compare their algorithms to any of the alternatives described in Section 1.2: i.e., by posing the constraints of other works (e.g., budget constraints, or minimum reward guarantees) in a linear constraint format.

Paper ID:	4958
Title:	Linear Stochastic Bandits Under Safety Constraints

Reviewer 1

Reviewer 2

Reviewer 3