|
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors propose a probabilistic version of the "line search" procedure that is commonly used as a subroutine in many deterministic optimization algorithms. The new technique can be applied when the evaluations of the objective function and its gradients are corrupted by noise. Therefore, the proposed method can be successfully used in stochastic optimization problems, eliminating the requirement of having to specify a learning rate parameter in this type of problems. The proposed method uses a Gaussian process surrogate model for the objective and its gradients. This allows us to obtain a probabilistic version of the conditions commonly used to terminate line searches in the deterministic scenario. The result is a soft version of those conditions that is used to stop the probabilistic line search process. At each iteration within such process, the next evaluation location is collected by using Bayesian optimization methods. A series of experiments with neural networks on the MNIST and CIFAR10 datasets validate the usefulness of the proposed technique.
Quality:
The experiments illustrate the advantages of the proposed technique. However, it is not clear to me how strong the experiments are. The authors only considered 10 epochs for training the neural networks, which seems to be very low for the type of problems that the authors are considering. No one uses only 10 epochs to fit a neural network to MNIST or CIFAR10. I would have expected at least 100 epochs here, probably more. Also, the current state-of-the-art for training neural networks uses techniques such as rmsprop or adadelta to scale the noisy gradients so that they have a uniform magnitude across the different weights. The authors indicate that their method is a complement and not a competitor to these techniques, but it would have been great to see the results of applying the proposed technique in combination with rmsprop or adadelta:
Tieleman, Tijmen, and Geoffrey Hinton. "Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude." COURSERA: Neural Networks for Machine Learning 4 (2012).
Zeiler, Matthew D. "ADADELTA: An adaptive learning rate method." arXiv preprint arXiv:1212.5701 (2012).
The lack of these experiments prevents us from knowing if the proposed technique can actually be used to improve the current state-of-the-art in stochastic optimization.
Another weak point that the authors should probably comment on is the additional cost that is implied by equations (17) and (18) to estimate the noise level in the noisy evaluations of the objective and its gradients. With neural networks, the evaluation of (17) requires to compute the gradient individually for each data point in the minibatch. This prevents the use of the matrix operations that are commonly used to compute the gradient for the whole minibatch in practical implementations of neural network optimizers. Using those matrix operations produces significant speed ups in the implementation of the computation of the gradient for the minibatch. This means that by not using them, the technique proposed by the authors may produce a significant reduction in optimization speed. To avoid this, the authors could design other ways to estimate the noise level, for example by having exponentially weighted moving averages in a smimilar way as in the rmsprop technique.
Clarity:
The paper is very clearly written and organized. A few things could be improved for clarity. For example, the derivation of equation (15). In lines 312 and 313, there is probably some missing parenthesis in the scaling and translation of y_i; something seems to be wrong also for the scaling and translaction of y_i': The values y(0)=0 and y'(0)=-1 are not obtained from the formulas given by the authors. The authors should also improve for clarity the paragraph between lines 300 and 309.
Originality:
The proposed method is new. This is the first application of probabilistic methods to extend line searches from the deterministic to the stochastic optimization scenario. The result is a technique that avoids having to specify learning rates in this type of optimization problems. However, there is a significant amount of work that attempts to provide automatic rules for selecting those learning rates:
Schaul, Tom, Sixin Zhang, and Yann Lecun. "No more pesky learning rates." Proceedings of the 30th International Conference on Machine Learning (ICML-13). 2013.
R. Ranganath, C. Wang, D. Blei, and E. Xing.
An adaptive learning rate for stochastic variational inference.
International Conference on Machine Learning, 2013.
The authors should probably have cited and possibly compared with some of these methods.
Significance:
The approach followed by the authors is different from other techniques already proposed to automatically select learning rates (see references above). The proposed method has the potential to improve the current-state-of-the-art for solving stochastic optimization problems. The experiments performed seem to suggest that. However, the lack of an empirical evaluation in combination with techniques such as adadelta or rmsprop prevents us from achieving that conclusion.
Q2: Please summarize your review in 1-2 sentences
The authors propose a technique for avoiding having to specify learning rates in stochastic optimization problems. The proposed technique is based on a probabilistic approach that combines very successful methods used in the deterministic optimization case (Wolfe conditions) with state-of-the-art techniques for efficient global optimization of noisy functions (Bayesian optimization implemented using the expected improvement heuristic and Gaussian processes). All under a very computationally efficient implementation. The proposed technique has the potential to improve the sate-of-the-art in stochastic optimization. The main weakness of the paper is the experimental evaluation, which is not realistic. The authors should have considered more training epochs and combined their method with other techniques commonly used to solve stochastic optimization problems such as adadelta or rmsprop.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
One of the challenges of stochastic optimization with mini-batch gradient descent on, e.g., deep neural networks, is the selection of an appropriate
learning rate.
In deterministic optimization, this choice is often avoided by using line search.
This paper proposes a way to model the noise of stochastic objectives when performing line search and use Bayesian optimization to intelligently perform that search.
It shows that careful choices for the covariance function can lead to efficient searches and that standard Wolfe conditions can be framed probabilistically.
The paper provides recommendations for various meta-parameters in the algorithm.
Empirical evaluation is performed on two-layer neural networks and show that the optimization is robust to initial learning rate choices.
I love the idea of this paper.
I've hoped someone would write this paper for a while, as it has become clear that Bayesian optimization should have something to say about traditional line search.
I commend the authors for their thoughtful consideration of a variety of practical issues, from covariance choice, to the subtleties of weak vs strong Wolfe conditions, as well as their willingness to make practical recommendations about meta-parameters.
My major complaint about the paper is that it does not follow through and examine whether this additional work provides a win over the increasingly-large literature on adapting learning rates.
I really want this to be the "right" way to solve these kinds of problems, but I did not come away from the paper feeling like I should definitely use this.
That is, here are some questions that I wanted to see empirically answered, but were not:
- How sensitive is the method to variations in c_1, c_2, c_w, and \tau?
- How does this method compare to algorithms that have scale adaptation such as
those in in [8], [9], [10], and [14]?
- The choice of mini-batch size (10) is unusually small, and yet this has a
large impact on the noise.
What is the impact of this choice?
Other Comments and Questions:
- The Wolfe conditions are a particular heuristic for balancing improvement in
the line search against starting a new one, given limited information about
the function.
When exploring new probabilistic ideas, is it really necessary
to take this heuristic that seriously as a gold standard?
- When the estimators y_t and y'_t are being computed, or when they are
computed for different values of t in the line search, is the same subset of
the data re-used?
This was not clear to me, but the independence assumption
stated in Eq. (3) makes me think that they are different subsets.
- What does it mean for an objective to be "irregular"?
- I spent some time drawing samples from the covariance function in Eq. (4) and
did not find myself thinking these were interesting hypotheses for
objectives.
They tended to be straight lines with small perturbations. Also,
I found that \tau had a large impact; one expects that for large values of
\tau, the min functions would essentially just return \tau.
Thus \tau seems
to impact the effective scale of the problem.
I wanted a more thoughtful
discussion of this issue.
- Given that the paper is motivated by issues with line search for stochastic
objectives, a well-referenced and evidence-supported discussion of the
"brittleness" discussed in the introduction would've been helpful.
Moreover,
it seems two things are ultimately conflated in the paper: 1) using Bayesian
optimization for line search in stochastic _or_ deterministic problem, and 2)
Bayesian optimization as a way to address the brittleness of line search in
stochastic optimization.
- I don't really buy the need for analytic identification of the minima of the
mean function because the whole point of the Wolfe conditions is to avoid
the need to find exact minima; various heuristics for finding the minima
using, e.g., a grid, would be fine given the other costs involved.
Also, it
is somewhat worrying to make potentially mismatched modeling decisions simply
to guarantee that it is easy to find the minimum between the evaluations.
- I like the story told by Figure 2.
- The explanation of c_w=0.3 was potentially interesting and should go into the
supplementary material.
I suspect the authors have a careful argument to
make here that instead comes across as hand-wavy.
Also, please evaluate the
empirical sensitivity to this choice.
Presentational Issues:
- P1 L32: "... if their gradient is corrupted ..." to "... if their gradients
are corrupted ..."
- P1 L34: What is an "exchangeable" loss function?
- P1 L36-40: The notation where j=1..m and also "indices j are i.i.d. draws"
doesn't really make sense and could perhaps be made more clear.
- P1 L40: "... the error ... is unbiased ..." should probably be some more like
"... the estimator \hat{\mathcal{L}} of \mathcal{L} is unbiased ..."
- P2 Figure 1: I found this figure somewhat confusing.
I suggest drawing the
"true" function using a light color to make the decisions clearer.
It would
be helpful if the value of c_1 was mentioned in the caption.
I am not sure
that I understand the representation of the Wolfe conditions; is the idea
that if the tangent lines go through the orange regions then that point is
not satisfactory?
- P3 L146-153: The notation y_i was previously used to denote the data in
Section 1 and it is here being reused to indicate noisy function evaluations.
- P3 L133-153: The notation f is introduced as the objective, but the objective
was previously introduced as \mathcal{L} with the \hat{\mathcal{L}} being the
noisy estimate now represented as y.
I would like there to be greater
precision in the notation describing these functions.
- P3 Figure 2: The top figure is very busy for how small it is.
I suggest
removing the sample paths, i.e., the dashed lines.
- P4 L169-192: The somewhat unusual notation for partial derivatives would
benefit from a one-sentence explanation.
- P5 L231: I am assuming that \eta is the minimum of the mean over actual
previous evaluations and not candidates, but a sentence that explains this
would be helpful.
- P5 L258: The probability of Wolfe conditions being satisfied is a _truncated_
bivariate normal probability, I think.
- P5 Figure 3: Very little explanation of this figure in the caption and it
doesn't seem to be referred to in the text at all.
- P8 Figure 4: This figure is not well explained.
What do the colors mean
relative to the top and the bottom?
Are the colors on the bottom meant to
refer to the different vertical columns, i.e., initial learning rates of the
top?
Q2: Please summarize your review in 1-2 sentences
This paper presents a very nice idea that I like very much.
The discussion is practical and thoughtful, but the empirical evaluation is weak and leaves many important questions unanswered.
Submitted by Assigned_Reviewer_3
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper is concerned with line search procedures in stochastic gradient descend (SGD) algorithms. Line searches are indeed intrinsic and essential parts of classical non-linear optimization methods such CG and BFGS. The topic is certainly important hence it seems that they haven't been developed for SGD algorithms.
The paper starts with a brief introduction and literature review in Section 1, and then in Section 2 reviews some insights on the classical Wolfe conditions of deterministic line searches (as in e.g. [20]). It could be though emphasized a bit more clearly that the point of Wolfe-conditions is to be able to *prove the convergence* of the algorithms.
The main contributions of the paper are in Section 3, where it is proposed that the GP regression with integrated Wiener process prior should be used to implement the local function approximation in the line search instead of the classical quadratic/cubic interpolation. As is already known from [24,25], this GP approach is equivalent to "just" using a spline approximation to the function---however, it comes with a clear probabilistic interpretation of the approximant.
The proposed line search method seems to be based on discretizing the interval into N cells and then doing local interpolation in them... where has this strategy originally been analyzed in? This is a bit strange way of implementing the line search, because a typically used line search, e.g., in BFGS implementations is based on the bracketing/sectioning search described e.g. on pages 33- of "Fletcher (1987). Practical Methods of Optimization". Some explanation of the relationships between the methods would be nice. Or a reference.
One potential problem, which I see here is that one of the assumptions used to derive Wolfe's conditions is that the function f is bounded from below. However, an integrated Wiener process (the *sample function*) doesn't seem have a lower bound, does that break the theory? That is, are your assumptions on the (sample) function space really compatible with the assumptions used to derive the Wolfe's conditions? I think that this is not really a problem if you just have a sufficiently regular *optimization problem*, but you could think about this.
"Remark on convergence of SGD with line searches:" You can make anything converge in probability by using this trick, this is not a real convergence proof. If you really wish to analyze the convergence you really need to base on convergence proof on the Wolfe-conditions in the line search as is done, e.g., in the Fletcher's (1987) book cited above. But, I would not expect such a proof to be included in a NIPS article, but probably you could mention that this is not an "actual" proof of the convergence of the optimization algorithm.
Despite the limitations I think that the main idea of the approach is very intriguing.
Q2: Please summarize your review in 1-2 sentences
The paper proposes an SGD line search to is implemented as a GP interpolation problem with probabilistic implementations of Wolfe's conditions. Although the idea is very intriguing, the downsides are that the practical algorithm seems a bit strange from classical perpective, and it is not clear whether the theoretical assumptions underlying the Wolfe conditions are actually satisfied.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
Thanks to all reviewers for their positive feedback!
We are very happy that all reviewers vote for
acceptance.
Generally, the reviewers voiced 3 requests: more
analysis, more experiments and more citations. Many of the suggestions are
really great, and we will gratefully consider them for a long-form
version. For this 8-page NIPS submission, we respectfully ask the
reviewers to consider that the paper already cites 7 related methods
(including some of those proposed by the reviewers!); and already provides
3 supplementary pages of additional results. It is the nature of a
conference paper that not every possible point can be
addressed.
Detailed replies: -- Reviewer 1 -- - SGD is still
the most widely used optimization method for neural networks. It is
debatable how popular AdaDelta and rmsprop are, also compared to AdaGrad,
momentum, etc.. 10 epochs is not an unusually short training time for
MNIST; it is evidently enough to achieve reasonable test errors!
-
It is absolutely possible to implement the variance estimation in matrix
operations, and on GPUs. The additional operation is an element-wise
square of the gradient elements. As mentioned in the paper, this causes
about 1% cost increase per minibatch. This estimator is not our invention,
it was also used by Shaul et al., as the paper mentions (ref [14]).
-- Reviewer 2 -- - As footnote 1 points out, Wolfe conditions
are _also_ useful to guarantee convergence (e.g. for BFGS). But this is
not their only use (see e.g. Nocedal & Wright, Chpt. 3.1). For
gradient descent, the main practical point is to find good step
sizes.
- Interpolation as an alternative to bracketing is e.g.
discussed in Nocedal & Wright, page 57ff. Our implementation is based
on the version in minimize.m, hence this is what we cite.
- While
the _argument_ for the Wolfe conditions might assume a bound, their
_definition_ does not involve boundedness. (Besides, it is of course
possible to bound the excursion range of Wiener processes up to
arbitrarily small probabilities).
-- Reviewer 4 -- Thank you
very much for a very thorough review!
- You are suggesting great
additional experiments, which we will gladly consider for a journal
version. But see our above note on space constraints. Our choices for c1
and c2 are similar to those mentioned in works like Nocedal & Wright,
and were not tuned. In our experiments changing the threshold c_W across
wide ranges only had minor effect, because in _most_ line searches, p_W
tends to change relatively clearly after an evaluation. So ranging c_W
from about 0.1 to 0.9 has little effect. Under the extreme choices c_W
close to 0 and 1, the line search accepts almost every, or almost no
points, respectively.
- Batches were indeed switched after each
function evaluation, as is standard wherever sgd is used. Thus the
independence assumption is correct. Batch size 10 is not unusually small,
some people even use unit batches. Since big batches are easier for the
line search (if the batch is the entire dataset, it reverts to the classic
noise-free case), using small batches gives a good lower bound for the
performance.
- It's amazing that you did experiments with the
integrated Wiener process! Crucial here is that each line search starts
with an initial evaluation of (f,f'). Because the integrated Wiener
process is a Markov process whose state space consists of precisely
(f,f'), this first evaluation "shields" against tau: If the first
evaluation were made without noise, it would form a Markov blanket, and
tau would have no effect. If the first evaluation is made with noise, then
for _large_ values of tau (because the marginal variance grows with tau^3,
even tau=10 is "large"), the posterior over (f,f') at t=0 approaches the
Gaussian observation likelihood of the first evaluation. (for tau -> 0,
the observation will be ignored). The integrated Wiener process is the
unique Gauss-Markov process that only assumes the existence of (f,f'). In
this sense it is the "optimal" (minimally assertive) choice. To get more
interesting samples, try conditioning on an observation (f(0),f'(0)) and
scaling the input domain (like we do)
- We use the analytic
posterior mean minima to produce _candidate_ points for evaluation. The
Wolfe conditions are _termination_ conditions. They are not used to guide
the search itself. Thus, model mismatch here is not all that crucial: A
bad model will just lead to an inefficient search strategy, just like in a
classic line search. The Wolfe conditions are only inferred at points with
available observations. There, the model (prior) has only weak effect.
Using a regular grid to build a candidate list would would waste
computation on uninteresting locations. The cool thing about using local
minima of the posterior mean is that it is *very* fast.
All minor
issues will be addressed, thanks for pointing them out!
--
Reviewers 5, 6, and especially 7 -- Thank you very much for your
positive reviews!
|
|