Part of Advances in Neural Information Processing Systems 17 (NIPS 2004)

*Olivier Chapelle, Zaïd Harchaoui*

Choice-based conjoint analysis builds models of consumer preferences over products with answers gathered in questionnaires. Our main goal is to bring tools from the machine learning community to solve this prob- lem more efficiently. Thus, we propose two algorithms to quickly and accurately estimate consumer preferences.

1 Introduction

Conjoint analysis (also called trade-off analysis) is one of the most popular marketing re- search technique used to determine which features a new product should have, by conjointly measuring consumers trade-offs between discretized1 attributes. In this paper, we will fo- cus on the choice-based conjoint analysis (CBC) framework [11] since it is both widely used and realistic: at each question in the survey, the consumer is asked to choose one product from several.

The preferences of a consumer are modeled via a utility function representing how much a consumer likes a given product. The utility u(x) of a product x is assumed to be the sum of the partial utilities (or partworths) for each attribute, i.e. linear: u(x) = w x. However, instead of observing pairs (xl, yl), the training samples are of the form ({x1, . . . , xp}, y k k k ) indicating that among the p products {x1 , . . . , xp}, the yth was preferred. Without noise, k k k this is expressed mathematically by u(xyk ) u(xb ), b = y k k k .

Let us settle down the general framework of a regular conjoint analysis survey. We have a population of n consumers available for the survey. The survey consists of a questionnaire of q questions for each consumer, each asking to choose one product from a basket of p. Each product profile is described through a attributes with l1, ..., la levels each, via a vector of length m = a l s=1 s, with 1 at positions of levels taken by each attribute and 0 elsewhere.

Marketing researchers are interested in estimating individual partworths in order to per- form for instance a segmentation of the population afterwards. But traditional conjoint estimation techniques are not reliable for this task since the number of parameters m to be estimated is usually larger than the number of answers q available for each consumer. They estimate instead the partworths on the whole population (aggregated partworths). Here we

```
1e.g. if the discretized attribute is weight, the levels would be light/heavy.
```

aim to investigate this issue, for which machine learning can provide efficient tools. We also address adaptive questionnaire design with active learning heuristics.

2 Hierarchical Bayes Analysis

The main idea of HB2 is to estimate the individual utility functions under the constraint that their variance should not be too small. By doing so, the estimation problem is not ill-posed and the lack of information for a consumer can be completed by the other ones.

2.1 Probabilistic model

In this section, we follow [11] for the description of the HB model and its implementation. This method aims at estimating the individual linear utility functions ui(x) = wi x, for 1 i n. The probabilistic model is the following:

```
1. The individual partworths wi are drawn from a Gaussian distribution with mean (representing the aggregated partworths) and covariance (encoding population's heterogeneity),
2. The covariance matrix has an invert Wishart prior, and has an (improper) flat prior.
3. Given a set of products (x1, . . . xp), the probability that the consumer i chooses the product x is given by
exp(wi x) P (x|wi) = p . (1) exp(w b=1 i xb)
```

2.2 Model estimation

We describe now the standard way of estimating , w (w1, . . . , wn) and based on Gibbs sampling and then propose a much faster algorithm that approximates the maximum a posteriori (MAP) solution.

Gibbs sampling As far as we know, all implementations of HB rely on a variant of the Gibbs sampling [11]. During one iteration, each of the three sets of variables (, w and ) is drawn in turn from its posterior distribution the two others being fixed. Sampling for and is straightforward, whereas sampling from P (w|, , Y ) P (Y |w). P (w|, ) is achieved with the Metropolis-Hastings algorithm.

When convergence is reached, the sampling goes on and finally outputs the empirical ex- pectation of , w and . Although the results of this sampling-based implementation of HB3 are impressive, practitioners complain about its computational burden.

Approximate MAP solution So far HB implementations make predictions by evaluating (1) at the empirical mean of the samples, in contrast with the standard bayesian approach, which would average the rhs of (1) over the different samples, given samples w from the posterior. In order to alleviate the computational issues associated with Gibbs sampling, we suggest to consider the maximum of the posterior distribution (maximum a posteriori, MAP) rather than its mean.

```
2Technical papers of Sawtooth software [11], the world leading company for conjoint analysis softwares, provide very useful and extensive references. 3that we will call HB-Sampled or HB-S in the rest of the paper.
```

To find , w and which maximize P (, w, |Y ), let us use Bayes' rule,

```
P (, w, |Y ) P (Y |, w, ) P (w|, ) P (|) P ()
P (Y |w) P (w|, ) P () (2)
```

Maximizing (2) with respect to yields MAP = I+Cw , with C n+d w being the "covariance" matrix of the wi centered at : Cw = (wi - )(wi - ) . Putting back this value in (2), we get

```
- log P (, w, MAP|Y ) = - log P (Y |w) + log |I + Cw()| + C, (3)
```

where C is an irrelevant constant. Using the model (1), the first term in the rhs of (3) is convex in w, but not the second term. For this reason, we propose to change log |I + Cw| by trace(Cw) = ||wi - ||2 (this would be a valid approximation if trace(Cw) 1). With this new prior on w, the rhs of (3) becomes

```
n
W (, w) = - log P (Yi|wi) + ||wi - ||2. (4) i=1
```

As in equation (3), this objective function is minimized with respect to when is equal to the empirical mean of the wi. We thus suggest the following iterative scheme to minimize the convex functional (4):

```
1. For a given , minimize (4) with respect to each of the wi independently.
2. For a given w, set to the empirical mean4 of the w.
```

Thanks to the convexity, this optimization problem can be solved very efficiently. A New- ton approach in step 1, as well as in step 2 to speed-up the global convergence to a fixed point , has been implemented. Only couple of steps in both cases are necessary to reach convergence.

Remark The approximation from equation (3) to (4) might be too crude. After all it boils down to setting to the identity matrix. One might instead consider as an hyperparam- eter and optimize it by maximizing the marginalized likelihood [14].

3 Conjoint Analysis with Support Vector Machines

Similarly to what has recently been proposed in [3], we are now investigating the use of Support Vector Machines (SVM) [1, 12] to solve the conjoint estimation problem.

3.1 Soft margin formulation of conjoint estimation

Let us recall the learning problem. At the k-th question, the consumer chooses the yth k product from the basket {x1 , . . . , xp}: w xyk w xb , b = y k k k k k . Our goal is to estimate the individual partworths w, with the individual utility function now being u(x) = w x. With a reordering of the products, we can actually suppose that yk = 1. Then the above inequalities can be rewritten as a set of p - 1 constraints:

```
w (x1 - xb ) 0, 2 b p. k k (5)
```

Eq. (5) shows that the conjoint estimation problem can be cast as a classification problem in the product-profiles differences space. From this point of view, it seems quite natural to use state-of-the-art classifiers such as SVMs for this purpose.

```
4which is consistent with the L2-loss measuring deviations of wi-s from .
```

More specifically, we propose to train a L2-soft margin classifier (see also [3] for a similar approach) with only positive examples and with a hyperplane passing through the origin (no bias), modelling the noise in the answers with slack variables kb:

```
Minimize w2 + C q p 2 k=1 b=2 kb subject to w (x1 - xb ) 1 - k k kb.
```

3.2 Estimation of individual utilities

It was proposed in [3] to train one SVM per consumer to get wi and to compute the individual partworths by regularizing with the aggregated partworths w = 1 n w n i=1 i: w = wi+w . i 2

Instead, to estimate the individual utility partworths wi, we suggest the following opti- mization problem (the set Qi contains the indices j such that the consumer i was asked to choose between products x1 , . . . , xp) : k k

```
~ Minimize w2 + C p 2 + C p 2 i qi kQi b=2 kb q b=2 kb j=i j k / Qi subject to wi (x1 - xb ) 1 - k k kb, k, b 2 .
```

Here the ratio C determines the trade-off between the individual scale and the aggregated ~ C one.5 For C = 1, the population is modeled as if it were homogeneous, i.e. all partworths ~ C wi are equal. For C 1, the individual partworths are computed independently, without ~ C taking into account aggregated partworths.

4 Related work

Ordinal regression Very recently [2] explores the so-called ordinal regression task for ranking, and derive two techniques for hyperparameters learning and model selection in a hierarchical bayesian framework, Laplace approximation and Expectation Propagation respectively. Ordinal regression is similar yet distinct from conjoint estimation since train- ing data are supposed to be rankings or ratings in contrast with conjoint estimation where training data are choice-based. See [4] for more extensive bibliography.

Large margin classifiers Casting the preference problem in a classification framework, leading to learning by convex optimization, was known for a long time in the psycho- metrics community. [5] pioneered the use of large margin classifiers for ranking tasks. [3] introduced the kernel methods machinery for conjoint analysis on the individual scale. Very recently [10] proposes an alternate method for dealing with heterogeneity in conjoint analysis, which boils down to a very similar optimization to our HB-MAP approximation objective function, but with large margin regularization and with minimum deviation from the aggregated partworths.

Collaborative filtering Collaborative filtering exploits similarity between ratings across a population. The goal is to predict a person's rating on new products given the person's past ratings on similar products and the ratings of other people on all the products. Again collaborative is designed for overlapping training samples for each consumer, and usually rating/ranking training data, whereas conjoint estimation usually deals with different ques- tionnaires for each consumer and choice-based training data.

```
5C ~ C In this way, directions for which the xj , j Qi contain information are estimated accurately, whereas the others directions are estimated thanks to the answers of the other consumers.
```

5 Experiments

Artificial experiments We tested our algorithms on the same benchmarking artificial ex- perimental setup used in [3, 16]. The simulated product profiles consist of 4 attributes, each of them being discretized through 4 levels. A random design was used for the question- naire. For each question, the consumer was asked to choose one product from a basket of 4. A population of 100 consumers was simulated, each of them having to answer 4 questions. Finally, the results presented below are averaged over 5 trials.

The 100 true consumer partworths were generated from a Gaussian distribution with mean (-, -/3, /3, ) (for each attribute) and with a diagonal covariance matrix 2I. Each answer is a choice from the basket of products, sampled from the discrete logit-type distri- bution (1). Hence when (called the magnitude6) is large, the consumer will choose with high probability the product with the highest utility, whereas when is small, the answers will be less reliable. The ratio 2/ controls the heterogeneity7 of the population.

Finally, as in [3], the performances are computed using the mean of the L2 distances be- tween the true and estimated individual partworths (also called RMSE). Beforehand the partworths are translated such that the mean on each attribute is 0 and normalized to 1.

Real experiments We tested our algorithms on disguised industrial datasets kindly pro- vided by Sawtooth Software Inc., the world leading company in conjoint analysis soft- wares.

11 one-choice-based8 conjoint surveys datasets9 were used for real experiments below. The number of attributes ranged from 3 to 6 (hence total number of levels from 13 to 28), the size of the baskets, to pick one product from at each question, ranged from 2 to 5, and the number of questions ranged from 6 to 15. The numbers of respondents ranged roughly from 50 to 1200. Since here we did not address the issue of no choice options in question answering, we removed10 questions where customers refused to choose a product from the basket and chose the no-choice-option as an answer11.

Finally, as in [16], the performances are computed using the hit rate, i.e. the misprediction rate of the preferred product.

5.1 Analysis of HB-MAP

We compare in this section our implementation of the HB method described in Section 2, that we call HB-MAP, to HB-S, the standard HB implementation.

The average training time for HB-S was 19 minutes (with 12000 iterations as suggested in [11]), whereas our implementation based on the approximation of the MAP solution took in average only 1.8 seconds. So our primary goal, i.e. to alleviate the sampling phase complexity, was achieved since we got a speed-up factor of the order of 1000.

The accuracy does not seem to be significantly weakened by this new implementation. Indeed, as shown in both Table 1 and Table 2, the performances achieved by HB-MAP were surprisingly often as good as HB-S's, and sometimes even a bit better. This might be

```
6as in [3], we tested High Magnitude ( = 3) and Low Magnitude ( = 0.5). 7It was either set to 2 = 3 or 2 = 0.5, respectively High and Low Heterogeneity cases. 8We limited ourselves to datasets in which respondents were asked to choose 1 product among a basket at each question. 9see [4] for more details on the numerical features of the datasets. 10One could use EM-based methods to deal with such missing training choice data. 11When this procedure boiled down to unreasonable number of questions for hold-out evaluation of our algorithms, we simply removed the corresponding individuals.
```

explained by the fact that assuming that the covariance matrix is quasi-diagonal is a reason- able approximation, and that the mode of the posterior distribution is actually roughly close to the mean, for the real datasets considered. Additionally it is likely that HB-S may have demanded much more iterations for convergence to systematically behave more accurately than HB-MAP as one would have normally expected.

5.2 Analysis of SVMs

We now turn to the SVM approach presented in section 3.2 that we call Im.SV12. We did not use a non-linear kernel in our experiments. Hence it was possible to minimize (3.2) directly in the primal, instead of using the dual formulation as done usually. This turned out to be faster since the number of constraints was, for our problem, larger than the number of variables. The resulting mean training time was 4.7 seconds. The so-called chapspan, span estimate of leave-one-out prediction error [17], was used to select a suitable value of C13, since it gave a quasi-convex estimation on the regularization path.

The performances of Im.SV in Table 2, compared to the HB methods and logistic regression [3] are very satisfactory in case of artificial experiments. In real experiments, Im.SV gives overall quite satisfactory results, but sometimes disappointing ones in Table 2. One reason might be that hyperparameters (C, ~ C) were optimized once for the whole population. This may also be due to the lack of robustness14 of Im.SV to heterogeneity in the number of training samples for each consumer.

```
Table 1: Average RMSE between estimated and true individual partworths
Mag Het HB-S HB-MAP Logistic Im.SV L L 0.90 0.83 0.84 0.86 L H 0.95 0.91 1.16 0.90 H L 0.44 0.40 0.43 0.41 H H 0.72 0.68 0.82 0.67
Table 2: Hit rate performances on real datasets.
Im.SV HB-MAP HB-S Im.SV HB-MAP HB-S Dat12 0.16 0.16 0.17 Dat15 0.52 0.45 0.48 Dat22 0.15 0.13 0.15 Dat25 0.58 0.47 0.51 Im.SV HB-MAP HB-S Im.SV HB-MAP HB-S Dat13 0.37 0.24 0.25 Dat1 Dat2 4 0.33 0.36 0.35 3 0.34 0.33 0.33 Dat2 Dat3 4 0.33 0.36 0.28 3 0.35 0.28 0.24 Dat3 Dat4 4 0.45 0.40 0.25 3 0.35 0.31 0.28
```

Legend of Tables 1 and 2 The first two columns indicate the Magnitude and the Heterogeneity (High or Low). p in Datmp is the number of products respondents are asked to choose one from at each question.

12since individual choice data are Immersed in the rest of the population choice data, via the optimization objective 13We observed that the value of the constant ~ C was irrelevant, and that only the ratio C/ ~ C mat- tered. 14Indeed the no-choice data cleaning step might have lead to a strong unbalance to which Im.SV is maybe much more sensitive than HB-MAP or HB-S.

6 Active learning

Motivation Traditional experimental designs are built by minimizing the variance of an estimator (e.g. orthogonal designs [6]). However, they are sub-optimal because they do not take into account the previous answers of the consumer. Therefore adaptive conjoint analysis was proposed [11, 16] for adaptively designing questionnaires.

The adaptive design concept is often called active learning in machine learning, as the algorithm can actively select questions whose responses are likely to be informative. In the SVM context, a common and intuitive strategy is to select, as the next point to be labeled, the nearest one from the decision boundary (see for instance [15]).

Experiments We implemented this heuristic for conjoint analysis by selecting for each question a set of products whose estimated utilities are as close as possible15. To compare the different designs, we used the same artificial simulations as in section 5, but with 16 questions per consumer in order to fairly compare to the orthogonal design.

```
Table 3: Comparison of the RMSE achieved by different designs.
Mag Het Random Orthogonal Adaptive L L 0.66 0.61 0.66 L H 0.62 0.56 0.56 H L 0.31 0.29 0.24 H H 0.49 0.45 0.34
```

Results in Table 3 show that active learning produced an adaptive design which seems efficient, especially in the case of high magnitude, i.e. when the answers are not noisy16.

Do not remove: This comment is monitored to verify that the site is working properly