|
Submitted by
Assigned_Reviewer_4
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
Update after reading the authors' rebuttal:
One of my main concerns was overhead do to GP inference. This has
been addressed by the authors reasonably well. The authors acknowledge the
other limitations brought up in my rebuttal. I have updated my quality
score although I think the technique needs to be studied more thoroughly.
==== 1. Summary
This paper presents a multi-task
Bayesian optimization approach to hyper-parameter setting in machine
learning models. In particular, it leverages previous work on multi-task
GP learning with decomposable covariance functions and Bayesian
optimization of expensive cost functions. Previous work has shown that
decomposable covariance functions can be useful in multi-task regression
problems (e.g. [13]) and that Bayesian optimization based on
response-surfaces can also be useful for hyper-parameter tuning of machine
learning algorithms [6, 7].
The paper combines the decomposable
covariance assumption [13] and Bayesian optimization based on expected
improvement [17] and entropy search [7] to show empirically that it is
possible to : (a) Transfer optimization knowledge across related
problems, addressing e.g. the cold-start problem (b) Optimize an
aggregate of different objective functions with applications to
speeding-up cross validation (c) Use information from a smaller
problem to help optimize a bigger problem faster
Positive
experimental results are shown on synthetic data (Branin-Hoo function),
optimizing logistic regression hyper-parameters and optimizing
hyper-parameters of online LDA on real data.
2. Quality
The paper is technically sound in that it combines existing
techniques that have been shown to be effective on multi-task regression
problems and on single task optimization of ML algorithms. The paper does
not make theoretical contributions to the field. Instead, it uses
heuristics for the tasks at hand. For example, when defining a procedure
for optimizing an average function over multiple tasks and when defining
an acquisition function based on the entropy search for the multi-task
problem.
The claims are supported empirically on various synthetic
and real problems. However, the following issues are of importance:
(a) The method is not consistently evaluated in terms of time but
for most experiments only the number of functions evaluations is
considered. It is not explicitly shown how expensive these function
evaluations are on the problems at hand and how they compare to the
overhead caused by using the multi-task GP response surface methodology.
This is of particular importance as sampling is done to marginalize over
the kernel parameters and to compute Equation (7). In other words, total
time that includes the overhead of the method should be shown across all
experiments.
(b) It is unclear how general the method really is.
The maximum number of function evaluations ranges from 30 up to around
200. This looks very small for general machine learning problems and as
one increases the number of function evaluations the overhead of the
method may be too high to be applicable.
(c) There is not a lot of
analysis on why the method works. In particular, in the small-to-large
experiment it may look counterintuitive that "smoothing" parameters
obtained in small data may help to find good "smoothing" parameters in
bigger data.
(d) It has been shown the the fixed correlation
assumption in multi-task regression problems may be too strong in
real-world problems (see e.g. Wilson et al (2012)). This problem may be
even more drastic in optimization settings where the objective functions
may be correlated differently in distinct regions of the parameter space.
(e) Unlike optimization of black box functions, in most
machine learning settings, there is knowledge of the function to be
optimized. This does not seem to be exploited in the proposed approach.
(f) The problem of learning "spurious" correlations, especially
when having a small number of function evaluations is not addressed in the
paper.
(g) In the example provided in Figure 3, it looks that by
having different tasks, the predicted variance is reduced and this has the
positive effect of helping finding the right point in a few number of
steps. However, response-based methods are known to have problems with
underestimating the variance [17]. This means that some regions of the
space will be massively under-explored. The paper does not address this
issue.
3. Clarity
The paper is well written and well
organized and provides adequate background to the reader familiar with
GPs. However the following items need consideration:
(a) It does
not seem that the Kronecker structure holds for all experiments. It seems
that Equation (3) should refer to a decomposable covariance function
rather than a Kronecker product.
(b) There is not sufficent detail
in the experiments to be reproducible. for example, to generate the
results in Fig 3, it is unclear what parameters are optimized in the LR
case.
4. Originality
The paper is a combination of
existing ideas. There are not novel theoretical contributions but the
claims are supported empirically.
5. Significance
The
proposed approach is unlikely to have big impact on practitioners or
researchers. There are not sufficient theoretical grounds on why or when
their method can be useful. The experimental settings may be quite
restrictive, for example when one requires a large number of function
evaluations the overhead of the technique may be too high.
6.
Additional comments
(a) line 139: "due to it's simple form" -->
"due to its simple form"
(b) line 313: Should it be Figure 3 (e),
(f) instead of Figure 5?
References Wilson, A. G.,
Knowles, D. A., and Ghahramani, Z. (2012). Gaussian process regression
networks. In ICML. Q2: Please summarize your review in
1-2 sentences
This paper presents an approach to optimization of
hyper-parameter in machine learning methods that combines existing ideas
on multi-task learning and Bayesian optimization. Overall, the paper does
not make new theoretical contributions but it supports its claims
experimentally. However, the experiments fail at analyzing how general the
method really is, the potential overhead of the technique and when it is
likely to work or not. Submitted by
Assigned_Reviewer_5
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
The paper addresses the problem of automatically
selecting hyperparameters for machine learning architectures. The authors
consider Bayesian optimization (BO) with Gaussian processes (GP) and adapt
this framework to a multi-task setup. This setup allows the sharing of
information across multiple optimization problems, and can thus, for
instance, exploit knowledge acquired while optimizing the hyperparameters
of a learning architecture for one dataset during the optimization of
these parameters for a second dataset. They explore three variants of
their scheme: (1) transferring knowledge from previously completed
optimizations; (2) speeding-up cross-validation by cleverly selecting
folds instead of evaluating all parameter settings on all folds; and (3) a
cost-sensitive version of hyperparameter optimization that can take
advantage of cheaper surrogate tasks (e.g. optimizing the parameters on
subsets of the data) for solving a computationally more expensive primary
problem by actively selecting the task that is likely to yield the largest
improvement per unit cost (e.g. compute time).
Optimizing
the hyper-parameters of machine learning architectures and algorithms is
an important practical problem. The paper falls in line with several
recent works that apply sequential model-based optimization (SMBO)
techniques, including BO with GPs, to this end (see e.g. refs in the
paper). The idea of applying ideas from multi-task learning to BO for
hyper-parameter optimization is not completely novel (see the very recent
paper by Bardenet et al.; ref [18]), but appears promising and has, to my
knowledge, not been explored widely.
The paper is generally well
written, providing a good explanation of GP BO, and of the proposed
extensions. There may be a few places where clarity could be improved (see
below).
The contribution of the paper is primarily that it brings
together, in a sensible manner, various existing tools from the machine
learning literature to adapt BO to a multi-task setup (I don't think the
"ingredients" themselves are really novel). Furthermore, it proposes --
and empirically explores -- three scenarios for knowledge sharing across
tasks in the context of automatic hyper-parameter optimization that are
likely to be useful in practice.
Specifically, it seems to be
building on the work of Snoek et al. (ref [6]), integrating the multi-task
kernel of Bonilla et al. (ref [13]), and the recently proposed
entropy-search (ES) acquisition function of Hennig & Schuler ([7]). It
further finds a new use for the idea of cost-sensitive BO that was
previously proposed in [6]. These "architectural" choices seem generally
sensible (see below for some additional comments). Of the three
application scenarios considered I find especially scenario (3)
interesting, and it appears to be a good potential use-case for
cost-sensitive BO.
I find the experimental evaluation is generally
convincing: For each scenario, the authors provide at least one real-world
example demonstrating that the respective multi-task approach leads to
equally good or better settings of the hyper-parameters in a shorter
amount of time than performing BO without taking into account the
information from the related task.
In some cases additional
control experiments could have been performed to provide additional
baselines, test the limits of the proposed approach, and to justify and
tease apart the effects of the various architectural choices (see below
for details).
Overall, I think this is a well written paper that
is likely to be of practical value for the ML community.
Some additional comments:
- entropy
search criterion (eq. 6): I might well be mistaken but I am a bit confused
by the notation, specifically the conditioning of p(f | x, \theta, {x_n,
y_n}) on x. If I'm understanding correctly, p(f|...) represents the
K-dimensional Gaussian distribution arising from evaluating the posterior
over the Gaussian process at points in \tilde{X}. So why do you condition
on x, but not on the remaining points of \tilde{X}? Couldn't you just drop
x here since you are indexing f in the product term?
Also, this
seems to be slightly inconsistent with eq. (7) where p(f|x) indicates that
the GP is evaluated at a candidate location for an observation (which, if
I'm understanding correctly, doesn't necessarily have to coincide with any
of the points in \tilde{X})?
- I think [7] express your equation
(7) in terms of the KL between the P^y_{min} and a uniform distribution.
They comment that this is a better choice than the KL you are considering
(their page 14). Maybe it'd be worth commenting on this?
-
two-step heuristic in CV experiments (lines 204ff): Why is this two step
approach necessary? Given that the number of folds is probably not huge
couldn't you optimize the average objective with respect to an x for each
task, and then choose the best x from that set? (Although maybe not much
will be gained by this.)
- EI vs entropy search (ES; line 240f):
You say that EI does not generalize directly to the multi-task setup while
ES does. Maybe you could explain why? Couldn't you use the EI on the
primary task (per unit cost) as acquisition function? After all, you seem
to be using EI on the primary task anyway for the generation of candidate
points at which to evaluate the KL-objective (line 261f).
- eq. 9:
Is there a particular reason why you express (9) in terms of a difference
in entropies and not in terms of KL as in (7)?
- Experiments in
sections 4.1 / 4.2: Am I right to assume that in these experiments you
always use the EI criterion as acquisition function? Would you expect ES
to perform any differently in these experiments? What about the comparison
between STBO and MTBO in section 4.3: how much of the difference is
attributable to multi-task vs. single-task and how much to the different
acquisition functions?
- I assume that for the experiments in
4.1/4.2 single-task and multi-task function evaluations take roughly the
same amount of time so that there is no need to plot performance as a
function of compute time?
- Experiments in section 4.1: How do the
presented warm-start results compare to very simple baseline strategies
for transferring knowledge from previously completed optimizations such as
> just using the best-performing hyperparameter setting found in
the previous optimization > fixing some the parameters to good
values found in the previous optimization and searching only over the
remaining ones > searching only over a smaller range for all or
some of the parameters based on the results of the previous optimization
- Experiments in section 4.1: Is there anything to be gained by
considering the results of more than a single completed previous
optimization? (E.g. for CNN on SVHN: combining information from CIFAR as
well as from the subset of SVHN?) Also, how does the usefulness of
information sharing depend on the number of iterations performed in the
previous optimization?
- Have you considered using alternative
multi-task kernels, e.g. one in which task similarity is not modeled
"free-form" but depends on certain task features?
- Section 4.3:
Your experiments seem to be considering a rather simple case of
cost-sensitive BO in that the cost exclusively depends on the task s.t.
c_t(x) = C_t (or does the cost actually depend on x?), and there are only
two tasks. Have you considered other, more challenging scenarios, e.g. the
possibility to vary the number of datapoints in the subsets at a finer
granularity?
- y-axis labels in Fig. 3e,f seem wrong - in
eq. (8) shouldn't there be a task label for each past observation e.g.
{x_n, y_n, t_n}?
- additional reference: Bergstra et al., ICML
2013: "Making a Science of Model Search: Hyperparameter OPtimization in
Hundreds of Dimensions for Vision Architectures"
Q2: Please summarize your review in 1-2
sentences
I think this is a well written paper that is likely to
be of practical value for the ML community.
Submitted by
Assigned_Reviewer_6
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
Overview
This paper extends GP-based on
Bayesian optimization in three ways: 1) multi-task GP to address
cold-start problems, 2) improving efficiency of cross-validation and 3)
save optimization cost using information from a cheaper function. These
extensions are well supported by corresponding experimental results.
Quality The proposed ideas are novel, the techniques are sound
and the results are informative and convincing Here are a few
questions. 1. The statement of marginalizing out GP parameters by
elliptical sampling may a little misleading. The reason is that,
conditioning on different number of samples, the posterior distributions
of the GP parameters change and running the elliptical sampling for a
longer time for convergence for each new sample or function evaluation
will be too expensive. So probably the sampler is used with only a few
iterations (as a search method) or fixed after a while? Could the authors
give more details on this? 2. A related, natural question is why not
simply using Bayesian optimization to tune GP parameters based on another
GP, which is hopefully less sensitive to its parameter values compared to
the first GP for the original Bayesian optimization 3. Given cubic
cost of GP inference, how about using Sparse GP in case we have a a
complicated function for which a lot of more function evaluations are
needed? This could further save the cost.
Clarity The paper is
well written and I enjoyed reading it. At some places (e.g., discussion of
the need of improving cross-validation) I felt the paper is a little
repetitive and wordy. Minor typos: Line 308: Figure 5 -> Figure
3. And explain LR in the figure subtitle. Reference: Line 440, 459,
Bayesian -> Bayesian Line 455 gaussian -> Gaussian
Originality Although these are extensions of a recent work,
the proposed approaches are quite novel.
Significance I expect
this paper to have a high impact because tuning hyper-parameters is a
common important problem to many machine learning algorithms.
Q2: Please summarize your review in 1-2
sentences
The paper uses matrix-variate GP in Bayesian
optimization to address cold-start problems, improve efficiency of
cross-validation and save the optimization cost by using information from
a cheaper function. The proposed ideas are novel, the techniques are sound
and the results are informative.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 6000 characters. Note
however that reviewers and area chairs are very busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank the reviewers for their valuable feedback.
General comments: It is important to clarify that as it is
defined, EI cannot be used directly for MTBO (sec. 4.3) because the
improvement on the primary task from an auxiliary observation is always 0.
To solve this issue, we introduced cost-sensitive ES. We feel that this is
one of the main novel contributions of the paper.
We used the
open-source code from [6] as the basis for our experiments. Like them, we
only draw a few samples per iteration, warm-starting from the previous
value. In general, each function evaluation takes on the order of hours
while selecting the next point takes seconds to minutes, so the GP
overhead is negligible. Fig. 5 includes timing results, but we will
provide more. For comparison, MTBO takes around 6 processor days while the
grid search in [28] takes 60-120.
One reason why our method works
is that problems from similar domains tend to have similar error surfaces;
humans tend to make the same assumption during manual tuning. Single-task
optimization generally needs to do a lot of initial exploration. In the
multi-task setting, these regions are not re-explored since bad regions
tend to be bad across many different tasks.
Theoretical analysis
of Bayesian optimization remains an active topic of research. However,
despite the lack of theoretical proofs this approach has been shown to
consistently beat popular heuristics like grid search, random search and
manual tuning. We feel that previous work in this area, along with our
empirical results on multiple challenging machine learning problems
justifies the applicability of this approach.
Regarding community
interest: As machine learning models have become more sophisticated and
have more knobs to turn, there has been increasing interest in tools for
hyperparameter tuning. In the last several years, such tools - and
Bayesian optimization in particular - have set records on a variety of
difficult machine learning benchmarks. There have been recent ICML and
NIPS workshops on related topics and at least one popular Coursera ML
course includes Bayesian optimization as a topic.
Comments to
specific reviewers:
Reviewer 1: Quality: a) Fig. 5a and 5d
show timing comparisons. The time spent on GP inference is negligible
compared to the cost of each function evaluation. See the general comments
for more details.
b) While there may be cases where more than 200
iterations are needed, we show on several challenging problems that human
performance can be matched or exceeded in far fewer iterations.
c)
In general, we would of course not expect them to be identical. The point
is that we can leverage even weak information to improve performance. This
is somewhat similar to parallel tempering MCMC methods, for example, in
which smoother versions of the problem inform the difficult primary
problem.
d) We agree that this is a limitation, but feel that this
is a topic for future work.
e) We utilize prior knowledge when
setting the boundaries of the search domain. We also used prior knowledge
that e.g., all tasks use convolutional networks for object recognition
(sec. 4.1), or that we are performing cross-validation (4.2). In a sense,
the paper is entirely about including such prior knowledge, because the
user specifies which tasks are related.
f) We marginalize out the
correlations with MCMC, so we hope to be robust to this when there are few
evaluations; this appears to work well.
g) It is true that this is
a larger issue for Bayesian optimization. However, we feel that we avoid
many of these fragility issues, and observe this empirically, by using a
fully Bayesian treatment that marginalizes over all GP hyperparameters.
Clarity: a) The Kronecker structure is used in all multi-task
experiments. Kx is itself decomposable.
b) The LR experiments,
among others, were reproduced from [6]. We will include these details.
Reviewer 2: Notation - Thank you, we will certainly make this
more clear.
Base distribution - We compared to using uniform as a
base distribution for ES. Unlike [7] we found that the previous Pmin
worked better for our problems.
Average EI - On the average
objective, all tasks share the same x, so there is no way to optimize this
wrt x for each task individually. We did try optimizing EI for each task
individually, but our current approach worked much better.
KL -
Both KL and information gain are equivalent, however we felt that
cost-sensitive IG is more intuitive than cost-sensitive KL. We will
clarify this.
ES in experiments - As an acquisition function, we
found that ES and EI perform similarly so we used EI when we could due to
it’s analytic form. In sec. 4.3 we found no way to use EI (see general
comments).
Timing - Yes, the cost of the multi-task version comes
from having more observations, the additional cost of this is negligible
compared to the function evaluations themselves.
Baselines - We
found that the best settings from one task were generally not the best
settings on other tasks and could be greatly improved. We will include
these results as baselines. The other suggestions are interesting, but
also require more domain knowledge.
4.1 - Using more tasks would
likely help, as would more observations on previous tasks. The cost would
be more expensive inference.
Kt - We are currently looking into
using features e.g., dataset size, for Kt.
4.3 - We only used one
cost per task, but [6] considered the general case and we expect that it
will work in our setting too.
Citation - Thank you, we will add
this reference.
Reviewer 3: Quality: 1. See general
comments above.
2. This would be very interesting if it worked,
but perhaps beyond the scope of this paper.
3. We are currently
exploring sparse variants for computational efficiency.
Clarity:
We will try to make the writing more concise.
| |