{"title": "Beyond Exchangeability: The Chinese Voting Process", "book": "Advances in Neural Information Processing Systems", "page_first": 4934, "page_last": 4942, "abstract": "Many online communities present user-contributed responses, such as reviews of products and answers to questions. User-provided helpfulness votes can highlight the most useful responses, but voting is a social process that can gain momentum based on the popularity of responses and the polarity of existing votes. We propose the Chinese Voting Process (CVP) which models the evolution of helpfulness votes as a self-reinforcing process dependent on position and presentation biases. We evaluate this model on Amazon product reviews and more than 80 StackExchange forums, measuring the intrinsic quality of individual responses and behavioral coefficients of different communities.", "full_text": "Beyond Exchangeability: The Chinese Voting Process\n\nMoontae Lee\n\nSeok Hyun Jin\n\nDavid Mimno\n\nDept. of Computer Science\n\nDept. of Computer Science\n\nDept. of Information Science\n\nCornell University\nIthaca, NY 14853\n\nmoontae@cs.cornell.edu\n\nCornell University\nIthaca, NY 14853\n\nsj372@cornell.edu\n\nCornell University\nIthaca, NY 14853\n\nmimno@cornell.edu\n\nAbstract\n\nMany online communities present user-contributed responses such as reviews of\nproducts and answers to questions. User-provided helpfulness votes can highlight\nthe most useful responses, but voting is a social process that can gain momentum\nbased on the popularity of responses and the polarity of existing votes. We propose\nthe Chinese Voting Process (CVP) which models the evolution of helpfulness votes\nas a self-reinforcing process dependent on position and presentation biases. We\nevaluate this model on Amazon product reviews and more than 80 StackExchange\nforums, measuring the intrinsic quality of individual responses and behavioral\ncoef\ufb01cients of different communities.\n\n1\n\nIntroduction\n\nWith the expansion of online social platforms, user-generated content has become increasingly\nin\ufb02uential. Customer reviews in e-commerce like Amazon are often more helpful than editorial\nreviews [14], and question answers in Q&A forums such as StackOver\ufb02ow and MathOver\ufb02ow are\nhighly useful for coders and researchers [9, 18]. Due to the diversity and abundance of user content,\npromoting better access to more useful information is critical for both users and service providers.\nHelpfulness voting is a powerful means to evaluate the quality of user responses (i.e., reviews/answers)\nby the wisdom of crowds. While these votes are generally valuable in aggregate, estimating the\ntrue quality of the responses is dif\ufb01cult because users are heavily in\ufb02uenced by previous votes. We\npropose a new model that is capable of learning the intrinsic quality of responses by considering their\nsocial contexts and momentum.\nPrevious work in self-reinforcing social behaviors shows that although inherent quality is an important\nfactor in overall ranking, users are susceptible to position bias [12, 13]. Displaying items in an order\naffects users: top-ranked items get more popularity, while low-ranked items remain in obscurity.\nWe \ufb01nd that sensitivity to orders also differs across communities: some value a range of opinions,\nwhile others prefer a single authoritative answer. Summary information displayed together can\nlead to presentation bias [19]. As the current voting scores are visibly presented with responses,\nusers inevitably perceive the score before reading the contents of responses. Such exposure could\nimmediately nudge user evaluations toward the majority opinion, making high-scored responses more\nattractive. We also \ufb01nd that the relative length of each response affects the polarity of future votes.\nStandard discrete models for self-re-\ninforcing process include the Chinese\nRestaurant Process and the P\u00f3lya urn\nmodel. Since these models are exchange-\nable, the order of events does not affect\nthe probability of a sequence. However,\nTable 1: Quality interpretation for each sequence of six votes.\nTable 1 suggests how different contexts\nof votes cause different impacts. While the four sequences have equal numbers of positive and\n\nDiff Ratio\n0.5\n0.5 moderately negative\n0.5 moderately positive\n0.5\n\nRelative Quality\n\nVotes\n\nRes\n1 + + +   0\n2 +  +  + 0\n3  +  + + 0\n4    + ++ 0\n\nquite negative\n\nquite positive\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fnegative votes in aggregate, the fourth votes in the \ufb01rst and last responses are given against a clear\nmajority opinion. Our model treats objection as a more challenging decision, thereby deserving\nhigher weight. In contrast, the middle two sequences receive alternating votes. As each vote is\na relatively weaker disagreement, the underlying quality is moderate compared to the other two\nresponses. Furthermore, if these are responses to one item, the order between them also matters. If\nthe initial three votes on the fourth response pushed its display position to the next page, for example,\nit might not have a chance to get future votes, which recover its reputation.\nThe Chinese Voting Process (CVP) models generation of responses and votes, formalizing the evo-\nlution of helpfulness under positional and presentational reinforcement. Whereas most previous work\non helpfulness prediction [7, 5, 8, 4, 11, 10, 15] has involved a single snapshot, the CVP estimates\nintrinsic quality of responses solely from selection and voting trajectories over multiple snapshots.\nThe resulting model shows signi\ufb01cant improvements in predictive probability for helpfulness votes,\nespecially in the critical early stages of a trajectory. We \ufb01nd that the CVP estimated intrinsic quality\nranks responses better than existing system rank, correlating orderly with the sentiment of comments\nassociated with each response. Finally, we qualitatively compare different characteristics of self-rein-\nforcing behavior between communities using two learned coef\ufb01cients: Trendiness and Conformity.\nThe two-dimensional embedding in Figure 1 characterizes different opinion dynamics from Judaism\nto Javascript (in StackOver\ufb02ow).\n\nFigure 1: 2D Community embedding. Each of 83 communities is represented by two behavioral coef\ufb01cients\n(Trendiness, Conformity). Eleven clusters are grouped based on their common focus. The MEAN community is\nsynthesized by sampling 20 questions from every community (except Amazon due to the different user interface).\nRelated work. There is strong evidence that helpfulness voting is socially in\ufb02uenced. Helpfulness\nratings on Amazon product reviews differ signi\ufb01cantly from independent human annotators [8]. Votes\nare generally more positive, and the number of votes decreases exponentially based on displayed\npage position. Review polarity is biased towards matching the consensus opinion [4]: when two\nreviews contain essentially the same text but differ in star rating, the review closer to the consensus\nstar rating is considered more helpful. There is also evidence that users vote strategically to correct\nperceived mismatches in review rank [16]. Many studies have attempted to predict helpfulness given\nreview-content features [7, 5, 11, 10, 15]. Each of these examples predicts helpfulness based on text,\nstar-ratings, sales, and badges, but only at a single snapshot. Our work differs in two ways. First, we\ncombine data on Amazon helpfulness votes from [16] with a much larger collection of helpfulness\nvotes from 82 StackExchange forums. Second, instead of considering text-based features (which we\nhold out for evaluation) within a single snapshot, we attempt to predict the next vote at each stage\nbased on the previous voting trajectory over multiple snapshots without considering textual contents.\n\n2 The Chinese Voting Process\n\nOur goal is to model helpfulness voting as a two-phase self-reinforcing stochastic process. In the\nselection phase, each user either selects an existing response based on their positions or writes a new\n\n2\n\n\fresponse. The positional reinforcement is inspired by the Chinese Restaurant Process (CRP) and\nDistance Dependent Chinese Restaurant Process (ddCRP). In the voting phase, when one response is\nselected, the user chooses one of the two feedback options: a positive or negative vote based on the\nintrinsic quality and the presentational factors. The presentational reinforcement is modeled by a\nlog-linear model with time-varying features based on the P\u00f3lya urn model. The CVP implements\nthe-rich-get-richer dynamics as an interplay of these two preferential reinforcements, learning latent\nqualities of individual responses as inspired in Table 1. Speci\ufb01cally, each user at time t interested in\nthe item i follows the generative story in Table 2.\n\nGenerative process\n\nSample parametrization (Amazon)\n\n1. Evaluate j-th response: p(z(t)\n\ni = j|z(1:t1)\n\ni\n\n(a) \u2018Yes\u2019: p(v(t)\n(b) \u2018No\u2019 : p(v(t)\n\ni = 1|\u2713) = logit1(qij + g(t1)\ni = 0|\u2713) = 1  p(v(t)\n\ni = 1|\u2713)\n\ni\n\n; \u21b5) / f(t1)\n(j))\n\ni\n\n(j) =\u21e3\n\n(j)\n\nf(t)\ni\n\ng(t)\ni\n\n2. Or write a new response: p(z(t)\n\ni = Ji + 1|z(1:t1)\n\ni\n\n; \u21b5) / \u21b5\n\n(a) Sample qi(J+1) from N (0, 2).\n\n1\n\n(j)\u2318\u2327\n\n1 + the-display-rank(t)\nij + \u00b5s(t)\n\nij + \u232biu(t)\n\nij\n\ni\n\n(j) = r(t)\n\u2713 = {{qij}, , \u00b5,{\u232bi}}\nJi = J(t1)\n\ni\n\n(abbreviated notation)\n\nTable 2: The generative story and the parametrization of the Chinese Voting Process (CVP).\n\n2.1 Selection phase\n\ni\n\ni\n\nThe CRP [1, 2] is a self-reinforcing decision process over an in\ufb01nite discrete set. For each item\n(product/question) i, the \ufb01rst user writes a new response (review/answer). The t-th subsequent user\ncan choose an existing response j out of J(t1)\npossible responses with probability proportional to\nthe number of votes n(t1)\ngiven to the response j by time t  1, whereas the probability of writing a\nj\nnew response J(t1)\n+ 1 is proportional to a constant \u21b5. While the CRP models self-reinforcement \u2014\neach vote for a response makes that response more likely to be selected later \u2014 there is evidence\nthat the actual selection rate in an ordered list decays with display rank [6]. Since such rankings are\nmechanism-speci\ufb01c and not always clearly known in advance, we need a more \ufb02exible model that\ncan specify various degrees of positional preference. The ddCRP [3] introduces a function f that\ndecays with respect to some distance measure. In our formulation, the distance function varies over\ntime and is further con\ufb01gurable with respect to the speci\ufb01c interface of service providers.\nSpeci\ufb01cally, the function f(t)\n(j) in the CVP evaluates the popularity of the j-th response in the item i\nat time t. Since we assume that popularity of responses is decided by their positional accessibility,\nwe can parametrize f to be inversely proportional to their display ranks. The exponent \u2327 determines\nsensitivity to popularity in the selection phase by controlling the degree of harmonic penalization\nover ranks. Larger \u2327 > 0 indicates that users are more sensitive to trendy responses displayed near\nthe top. If \u2327 < 0, users often select low-ranked responses over high-ranked ones for some reasons.1\nNote that even if the user at time t does not vote on the j-th response, f(t)\n(j) could be different from\nin the CRP. Thus one can view the selection phase of the\nf(t1)\ni\nCVP as a non-exchangeable extension of the CRP via a time-varying function f.\n\n(j) in the CVP,2 whereas n(t)\n\nij = n(t1)\n\nij\n\ni\n\ni\n\n2.2 Voting phase\n\nWe next construct a self-reinforcing process for the inner voting phase. The P\u00f3lya urn model is\na self-reinforcing decision process over a \ufb01nite discrete set, but because it is exchangeable, it is\nunable to capture contextual information encoded in each a sequence of votes. We instead use\na log-linear formulation with the urn-based features, allowing other presentational features to be\n\ufb02exibly incorporated based on the modeler\u2019s observations.\nEach response initially has x = x(0) positive and y = y(0) negative votes, which could be fractional\npseudo-votes. For each draw of a vote, we return w + 1 votes with the same polarity, thus self-\nreinforcing when w > 0. The following Table 3 shows time-evolving positive/negative ratios\nj ) of the \ufb01rst two responses: j 2 {1, 2} in Table 1 with\nr(t)\nj = x(t)\nthe corresponding ratio gain (t)\n\nj = 1 or +) or s(t)\n\nj ) and s(t)\n\nj = y(t)\n\nj = r(t)\n\nj /(x(t)\n\nj + y(t)\n\n(if v(t)\n\n(if v(t)\n\nj\n\nj /(x(t)\n\nj + y(t)\nj  r(t1)\n\nj = 0 or ).\n\nj  s(t1)\n\nj\n\n1This sometimes happens especially in the early stage when only a few responses exist.\n2Say the rank of another response j0 was lower than j\u2019s at time t  1. If t-th vote given to the response j0\n\nraises its rank higher than the rank of the response j, then f(t)\n\n(j) assuming \u2327 > 0.\n\n(j) < f(t1)\n\ni\n\ni\n\n3\n\n\ft or T\n\n0\n1\n2\n3\n4\n5\n6\n\nv(t)\n1\n\ns(t)\nr(t)\n1\n1\n1/2 1/2\n\n(t)\n1\n\nqT\n1\n\nv(t)\n2\n\ns(t)\nr(t)\n2\n2\n1/2 1/2\n\n(t)\n2\n\nqT\n2\n\n+ 2/3 1/3 0.167\n\n\n\n+ 2/3 1/3 0.167\n0.363  2/4 2/4 0.167 -0.363\n+ 3/4 1/4 0.083\n0.574 + 3/5 2/5 0.100\n0.004\n+ 4/5 1/5 0.050\n0.237  3/6 3/6 0.100 -0.230\n 4/6 2/6 0.133\n0.007\n0.004 + 4/7 3/7 0.071\n 4/7 3/7 0.095\n 4/8 4/8 0.071 -0.175  4/8 4/8 0.071 -0.166\n\nTable 3: Change of quality estimation qj over times for the \ufb01rst two example responses in Table 1 with the initial\npseudo-votes (x, y, w) = (1, 1, 1). The estimated quality at the \ufb01rst response sharply decreases when receiving\nthe \ufb01rst majority-against vote at t = 4. The \ufb01rst response ends up being more negative than the second, even if\nthey receive the same number of votes in aggregate. These non-exchangeable behaviors cannot be modeled with\na simple exchangeable process.\n\nIn this toy setting, the polarity of a vote to a response is an outcome of its intrinsic quality as well\nas presentational factors: positive and negative votes. Thus we model each sequence of votes by\n`2-regularized logistic regression with the latent intrinsic quality and the P\u00f3lya urn ratios.3\n\nmax\n\n\u2713\n\nlog\n\nTYt=2\n\nlogit1qT\n\nj + r(t1)\n\nj\n\n+ \u00b5s(t1)\n\nj\n\n \n\n1\n2k\u2713k2\n\n2 where \u2713 =qT\n\nj , , \u00b5\n\n(1)\n\nThe {qT\nj } in the Table 3 shows the result from solving (1) up to T -th votes for each j 2 {1, 2}. The\ninitial vote given at t = 1 is disregarded in the training due to its arbitrariness from the uniform\nprior (x0 = y0). Since it is quite possible to have only positive or only negative votes, Gaussian\nregularization is necessary. Note that using the urn-based ratio features is essential to encode\ncontextual information. If we instead use raw count features (only the numerators of rj and sj), for\nexample in the \ufb01rst response, the estimated quality qT\n1 keeps increasing even after getting negative\nvotes from time 4 to 6. Log raw count features are unable to infer the negative quality.\nIn the \ufb01rst response, (t)\nshows the decreasing gain in positive ratios from t = 1 to 3 and in negative\n1\nratios from t = 4 to 6, whereas it gains a relatively large momentum at the \ufb01rst negative vote when\n2 converges to 0 in the 2nd response, implying that future votes have less effect than earlier\nt = 4. (t)\nvotes for alternating +/ votes. qT\n2 also converges to 0 as we expect neutral quality in the limit.\nOverall the model is capable of learning intrinsic quality as desired in Table 1 where relative gains\ncan be further controlled by tuning the initial pseudo-votes (x, y).\nIn the real setting, the polarity score function g(t)\n(j) in the CVP evaluates presentational factors of\nthe j-th response in the item i at time t. Because we adopt a log-linear formulation, one can easily\nij and the negative\nadd additional information about responses. In addition to the positive ratio r(t)\nratio s(t)\nij (as given in Table 2), which is the relative length of the\nresponse j against the average length of responses in the item i at particular time t. Users in some\nitems may prefer shorter responses than longer ones for brevity, whereas users in other items may\nblindly believe that longer responses are more credible before reading their contents. The parameter\n\u232bi explains length-wise preferential idiosyncrasy as a per-item bias: \u232bi < 0 means a preference toward\nthe shorter responses. Note that g(t)\n(j) even if the user at time t does\nnot choose to vote.4 All together, the voting phase of the CVP generates non-exchangeable votes.\n\nij , g also contains a length feature u(t)\n\n(j) could be different from g(t1)\n\ni\n\ni\n\ni\n\n3\n\nInference\n\nEach phase of the CVP depends on the result of all previous stages, so decoupling these related\nproblems is crucial for ef\ufb01cient inference. We need to estimate community-level parameters, item-\nlevel length preferences, and response-level intrinsic qualities. The graphical model of the CVP and\ncorresponding parameters to estimate are illustrated in Table 4. We further compute two community-\nlevel behavioral coef\ufb01cients: Trendiness and Conformity, which are useful summary statistics for\nexploring different voting patterns and explaining macro characteristics across different communities.\n\n3One might think (1) can be equivalently achievable with only two parameters because of r(t)\n\nall t. However, such reparametrization adds inconsistent translations to qT\ndifferent inclinations between positive and negative votes for various communities.\n\nj = 1 for\nj and makes it dif\ufb01cult to interpret\n\nj + s(t)\n\n4If a new response is written at time t, u(t)\nij\n\nas the new response changes the average length.\n\n6= u(t1)\n\nij\n\n4\n\n\f\u21b5: hyper-parameter for response growth\n2: hyper-parameter for quality variance\n\u2327: community-level sensitivity to popularity\n: community-level preference for positive ratio\n\u00b5: community-level preference for negative ratio\n\u232bi: item-level preference for response length\nqij: response-level hidden intrinsic quality\nm: # of items (e.g., products/questions)\nJi: # of responses of item i (e.g., reviews/answers)\n\nProcess\n\nTable 4: Graphical model and parameters for the CVP. Only three time steps are unrolled for visualization.\n\nn\n\n\u21b5\n(t1)\ni\nj=1\n\n(j)N (q\n\ni,z(t)\n\nParameter inference. The goal is to infer parameters \u2713 = {{qij}, , \u00b5,{\u232bi}}. We sometimes use f\nand g instead to compactly indicate parameters associated to each function. The likelihood of one\nCVP step in the item i at time t is L(t)\n\ni (\u2327, \u2713; \u21b5, ) =\ni =J(t1)\n\ni\n\n+1]n\n\n; 0, 2)o [z(t)\n\nf(t1)\ni\n(t1)\ni\nj=1\n\n(z(t)\n\n)\n\ni\nf(t1)\ni\n\np(v(t)\n\ni\n\n|q\n\ni,z(t)\n\n, g(t1)\n\ni\n\ni \uf8ffJ(t1)\n\ni\n\n]\n\n(j))o [z(t)\n\ni\n\ni\n\n(j)\n\nf(t1)\ni\n\n\u21b5+PJ\n\n\u21b5+PJ\nwhere the two terms correspond to writing a new response and selecting an existing response to\nvote. The fractions in each term respectively indicate the probability of writing a new response and\nchoosing existing responses in the selection phase. The other two probability expression in each term\ndescribe quality sampling from a normal distribution and the logistic regression in the voting phase.\nIt is important to note that our setting differs from many CRP-based models. The CRP is typically\nused to represent a non-parametric prior over the choice of latent cluster assignments that must\nthemselves be inferred from noisy observations. In our case, the result of each choice is directly\nobservable because we have the complete trajectory of helpfulness votes. As a result, we only\nneed to infer the continuous parameters of the process, and not combinatorial con\ufb01gurations of\ndiscrete variables. Since we know the complete trajectory where the rank inside the function f is a\npart of the true observations, we can view each vote as an independent sample. Denoting the last\ni and is\n\ntimestamp of the item i by Ti, the log-likelihood becomes `(\u2327, \u2713; \u21b5, ) =Pm\n\nfurther separated into two pieces:\n\ni=1PTi\n\nt=1 log L(t)\n\n`v(\u2713; ) =\n\n`s(\u2327 ; \u21b5) =\n\nmXi=1\nmXi=1\n\nTiXt=1n [write] \u00b7 log N (q\nTiXt=1n [write] \u00b7 log\n\ni,z(t)\n\ni\n\n; 0, 2) + [choose] \u00b7 log p(v(t)\n\ni\n\n|q\n\ni,z(t)\n\ni\n\n, g(t1)\n\ni\n\n\u21b5\n\n\u21b5 +PJ(t1)\n\ni\nj=1\n\n+ [choose] \u00b7 log\n\nf(t1)\ni\n\n(j)\n\n(j))o,\n\n(z(t)\n\n)\n\ni\nf(t1)\ni\n\n(2)\n\n(j)o.\n\nf(t1)\ni\n\n\u21b5 +PJ(t1)\n\ni\nj=1\n\nInferring a whole trajectory based only on the \ufb01nal snapshots would likely be intractable for a\nnon-exchangeable model. Due to the continuous interaction between f and g for every time step,\nsmall mis-predictions in the earlier stages will cause entirely different con\ufb01gurations. Moreover the\nrank function inside f is in many cases site-speci\ufb01c.5 It is therefore vital to observe all trajectories\nof random variables {z(t)\ni }: decoupling f and g reduces the inference problem into estimating\n, v(t)\nparameters separately for the selection phase and the voting phase. Maximizing `v can be ef\ufb01ciently\nsolved by `2-regularized logistic regression as demonstrated for (1). If the hyper-parameter \u21b5 is \ufb01xed,\nmaximizing `s becomes a convex optimization because \u2327 appears in both the numerator and the\ndenominator. Since the gradient for each parameter in \u2713 is obvious, we only include the gradient of\ns,i for the particular item i at time t with respect to \u2327. Then @`s\n`(t)\n\ns,i/@\u2327.\n\nt=1 @`(t)\n\ni\n\n@`(t)\ns,i\n@\u2327\n\n=\n\n1\n\n\u2327( [z(t)\n\ni \uf8ff J(t1)\n\ni\n\nf(t1)\ni\n\n] \u00b7\n\n(z(t)\n\n) log f(t1)\n\ni\nf(t1)\ni\n\ni\n(z(t)\n\ni\n\n)\n\n(z(t)\n\ni\n\n)\n\n@\u2327 =Pm\nPJ(t1)\n\ni=1PTi\n\u21b5 +PJ(t1)\n\nf(t1)\ni\n\ni\nj=1\n\ni\nj=1\n\n(j) log f(t1)\n\nif\nf(t1)\ni\n\n(j)\n\n(j)\n\n) (3)\n\n5We generally know that Amazon decides the display order by the portion of positive votes and the total\nnumber of votes on each response, but the relative weights between them are not known. We do not know how\nStackExchange forums break ties, which affects highly in the early stages of voting.\n\n5\n\n\f\n\n\u232bi\n\nSelection\nCommunity\nCRP CVP qij\n2.152 1.989 .107 .103 .108 .100\nSOF(22925)\n1.841 1.876 .071 .064 .067 .062\nmath(6245)\n1.969 1.924 .160 .146 .152 .141\nenglish(5242)\n1.992 1.910 .049 .046 .049 .045\nmathOF(2255)\n1.824 1.801 .174 .155 .166 .150\nphysics(1288)\n1.889 1.822 .051 .044 .048 .043\nstats(598)\n2.039 1.859 .135 .124 .132 .121\njudaism(504)\n2.597 2.261 .266 .270 .262 .254\namazon(363)\n1.411 1.575 .261 .241 .270 .229\nmeta.SOF(294)\n1.893 1.795 .052 .040 .053 .039\ncstheory(279)\n1.825 1.780 .128 .100 .118 .099\ncs(123)\nlinguistics(107) 1.993 1.789 .133 .127 .130 .122\n2.050 1.945 .109 .103 .108 .099\nAVERAGE\n\nResidual Bumpiness\nVoting\nqij,  qij, \u232bi , \u232bi Full Rank Qual Rank Qual\n.100 .096 .005 .003 .080 .038\n.060 .059 .014 .008 .280 .139\n.137 .135 .018 .007 .285 .149\n.046 .045 .009 .007 .185 .119\n.146 .142 .032 .014 .497 .273\n.042 .042 .030 .019 .613 .347\n.118 .116 .046 .018 .875 .403\n.253 .240 .023 .016 .392 .345\n.232 .225 .018 .013 .281 .255\n.039 .038 .032 .029 .485 .553\n.097 .096 .069 .040 .725 .673\n.120 .116 .074 .038 .778 .656\n.098 .095 .011 .006 .186 .101\n\n.106\n.066\n.147\n.047\n.156\n.046\n.125\n.243\n.243\n.049\n.113\n.123\n.105\n\nTable 5: Predictive analysis on the \ufb01rst 50 votes: In the selection phase, the CVP shows better negative\nlog-likelihood in almost all forums. In the voting phase, the full model shows better negative log-likelihood\nthan all subsets of features. Quality analysis at the \ufb01nal snapshot: Smaller residuals and bumpiness show that\nthe order based on the estimated quality qij more coherently correlates with the average sentiments of the\nassociated comments than the order by display rank. (SOF=StackOver\ufb02ow, OF=Over\ufb02ow, rest=Exchange, Blue:\np \uf8ff 0.001, Green: p \uf8ff 0.01, Red: p \uf8ff 0.05)\nBehavioral coef\ufb01cients. To succinctly measure overall voting behaviors across different communi-\nties, we propose two community-level coef\ufb01cients. Trendiness indicates the sensitivity to positional\npopularity in the selection phase. While the community-level \u2327 parameter renders Trendiness simply\nto avoid overly-complicated models, one can easily extend the CVP to have per-item \u2327i to better\n\ufb01t the data. In that case, Trendiness would be a summary statistics for {\u2327i}. Conformity captures\nusers\u2019 receptiveness to prevailing polarity in the voting phase. To count every single vote, we de\ufb01ne\nConformity to be a geometric mean of odds ratios between majority-following votes and majority-\ndisagreeing votes. Let Vi be the set of time steps when users vote rather than writing responses in the\nitem i. Say n is the total number of votes across all items in the target community. Then Conformity\nis de\ufb01ned as\n\n1\n\ni\n\ni\n\ni\n\ni\n\ni,z(t+1)\n\ni,z(t+1)\n\nP (v(t+1)\n\n, t, \u00b5t, \u232bt\n\n, t, \u00b5t, \u232bt\ni )\n\nwhere h(t)\n\n= 1|qt\n= 0|qt\n\n\uf8ff =( mYi=1Yt2Vi\u2713 P (v(t+1)\n\ni =(1\ni )1/n\ni )\u25c6h(t)\nTo compute Conformity \uf8ff, we need to learn \u2713t = {qt\ni} for each t, which is a set of\nparameters learned on the data only up to the time t. This is because the user at time t cannot see\nany future which will be given later than the time t. Note that \u2713t+1 can be ef\ufb01ciently learned by\nwarm-starting at \u2713t. In addition, while positive votes are mostly dominant in the end, the dominant\nmood up to time t could be negative, exactly when the user at time t + 1 tries to vote. In this case,\ni becomes 1, inverting the fraction to be the ratio of following the majority against the minority.\nh(t)\nBy summarizing learned parameters in terms of two coef\ufb01cients (\u2327, \uf8ff), we can compare different\nselection/voting behaviors for various communities.\n\nij  n(t)\n(n+(t)\nij < n(t)\n(n+(t)\n\nij, t, \u00b5t, \u232bt\n\n)\n)\n\n.\n\nij\n\nij\n\n4 Experiments\n\nWe evaluate the CVP on product reviews from Amazon and 82 issue-speci\ufb01c forums from the\nStackExchange network. The Amazon dataset [16] originally consisted of 595 products with daily\nsnapshots of writing/voting trajectories from Oct 2012 to Mar 2013. After eliminating duplicate\nproducts6 and products with fewer than \ufb01ve reviews or fragmented trajectories,7 363 products are left.\nFor the StackExchange dataset8, we \ufb01lter out questions from each community with fewer than \ufb01ve\nanswers besides the answer chosen by the question owner.9 We drop communities with fewer than\n100 questions after pre-processing. Many of these are \u201cMeta\u201d forums where users discuss policies\nand logistics for their original forums.\n\n6Different seasons of the same TV shows have different ASIN codes but share the same reviews.\n7If the number of total votes between the last snapshot of the early fragment and the \ufb01rst snapshot of the later\nfragment is less than 3, we \ufb01ll in the missing information simply with the last snapshot of the earlier fragment.\n\n8Dataset and statistics are available at https://archive.org/details/stackexchange.\n9The answer selected by the question owner is displayed \ufb01rst regardless of voting scores.\n\n6\n\n\fFigure 2: Comment and likelihood analysis on the StackOver\ufb02ow forum. The left panels show that responses\nwith higher ranks tend to have more comments (top) and more positive sentiments (bottom). The middle\npanels show responses have more comments at both high and low intrinsic quality qij (top). The corresponding\nsentiment correlates more cohesively with the quality score (bottom). Each blue dot is approximately an average\nover 1k responses, and we parse 337k comments given on 104k responses in total. The right panels show\npredictive power for the selection phase (top) and the voting phase (bottom) up to t < 50 (lower is better).\n\nPredictive analysis.\nIn each community, our prediction task is to learn the model up to time t and\npredict the action at t + 1. We align all items at their initial time steps and compute the average\nnegative log-likelihood of the next actions based on the current model. Since the complete trajectory\nenables us to separate the selection and voting phases in inference, we also measure the predictive\npower of these two tasks separately against their own baselines. For the selection phase, the baseline\nis the CRP, which selects responses proportional to the number of accumulated votes or writes a\nnew response with the probability proportional to \u21b5.10 When t < 50, as shown in the \ufb01rst column of\nTable 5, the CVP signi\ufb01cantly outperforms the CRP based on paired t-tests (two-tailed). Using the\nfunction f based on display rank and Trendiness parameter \u2327 is indeed a more precise representation\nof positional accessibility. Especially in the early stages, users often select responses displayed at\nlower ranks with fewer votes. While the CRP has no ability to give high scores in these cases, the\nCVP properly models it by decreasing \u2327. The comparative advantage of the CVP declines as more\nvotes become available and the correlation between display rank and the number of votes increases.\nFor items with t  50, there is no signi\ufb01cant difference between the two models as exempli\ufb01ed in the\nthird column of Figure 2. These results are coherent across other communities (p > 0.07).\nImproving predictive power on the voting phase is dif\ufb01cult because positive votes dominate in every\ncommunity. We compare the fully parametrized model to simpler partial models in which certain\nparameters are set to zero. For example, a model with all parameters but  knocked out is comparable\nto a plain P\u00f3lya Urn. As illustrated in the second column of Table 5, we verify that every sub-model\nis signi\ufb01cantly different from the full model in all major communities based on one-way ANOVA\ntest, implying that each feature adds distinctive and meaningful information. Having the item-speci\ufb01c\nlength bias \u232bi provides signi\ufb01cant improvements as well as having intrinsic quality qij and current\nopinion counts . While we omit the log-likelihood results with t  50, all model better predicts true\npolarity when t  50, because the log-linear model obtains a more robust estimate of community-level\nparameters as the model acquires more training samples.\nQuality analysis. The primary advantage of the CVP is its ability to learn \u201cintrinsic quality\u201d for\neach response that \ufb01lters out noise from self-reinforcing voting processes. We validate these scores\nby comparing them to another source of user feedback: both StackExchange and Amazon allow\nusers to attach comments to responses along with votes. For each response, we record the number\nof comments and the average sentiment of those comments as estimated by [17]. As a baseline, we\n\n10We \ufb01x \u21b5 to 0.5 after searching over a wide range of values.\n\n7\n\n\falso calculate the \ufb01nal display rank of each response, which we convert to a z-score to make it more\ncomparable to the quality scores qij. After sorting responses based on display rank and quality rank,\nwe measure the association between the two rankings and comment sentiment with linear regression.\nResults are shown for StackOver\ufb02ow in Figure 2. As expected, highly-ranked responses have more\ncomments, but we also \ufb01nd that there are more comments for both high and low values of intrinsic\nquality. Both better display rank and higher quality score qij are clearly associated with more positive\ncomments (slope 2 [0.47, 0.64]), but the residuals of quality rank 0.012 are on average less than\nthe half the residuals of display rank 0.028. In addition, we also calculate the \u201cbumpiness\u201d of these\nplots by computing the mean variation of two consecutive slopes between each adjacent pair of data\npoints. Quality rank reduces bumpiness of display rank from 0.391 to 0.226 in average, implying the\nestimated intrinsic quality yields locally consistent ranking as well as globally consistent.11\nCommunity analysis. The 2D embedding\nin Figure 1 shows that we can compare and\ncontrast the different evaluation cultures of\ncommunities using two inferred behavioral\ncoef\ufb01cients: Trendiness \u2327 and Conformity \uf8ff.\nCommunities are sized according to the num-\nber of items and colored based on a manual\nclustering. Related communities collocate\nin the same neighborhood. Religion, schol-\narship, and meta-discussions cluster towards\nthe bottom left, where users are interested\nin many different opinions, and are happy\nto disagree with each other. Going from left\nto right, communities become more trendy:\nusers in trendier communities tend to select\nand vote mostly on already highly-ranked\nresponses. Going from bottom to top, users\nFigure 3: Sub-community embedding for StackOver\ufb02ow.\nbecome increasingly likely to conform to the\nmajority opinion on any given response. By comparing related communities we can observe that\ncharacteristics of user communities determine voting behavior more than technical similarity. Highly\ntheoretical and abstract communities (cstheory) have low Trendiness but high Conformity. More\napplied, but still graduate-level, communities in similar \ufb01elds (cs, mathover\ufb02ow, stats) show less\nConformity but greater Trendiness. Finally, more practical homework-oriented forums (physics,\nmath) are even more trendy. In contrast, users in english are trendy and debatable. Users in Amazon\nare most sensitive to trendy reviews and least afraid of voicing minority opinion.\nStackOver\ufb02ow is by far the largest community, and it is reasonable to wonder whether the Trendiness\nparameter is simply a proxy for size. When we subdivide StackOver\ufb02ow by programming languages\nhowever (see Figure 3), individual community averages can be distinguished, but they all remain in\nthe same region. Javascript programmers are more satis\ufb01ed with trendy responses than those using\nc/c++. Mobile developers tend to be more conformist, while Perl hackers are more likely to argue.\n\n5 Conclusions\n\nHelpfulness voting is a powerful tool to evaluate user-generated responses such as product reviews\nand question answers. However such votes can be socially reinforced by positional accessibility and\nexisting evaluations by other users. In contrast to many exchangeable random processes, the CVP\ntakes into account sequences of votes, assigning different weights based on the context that each vote\nwas cast. Instead of trying to model the response ordering function f, which is mechanism-speci\ufb01c\nand often changes based on service providers\u2019 strategies, we leverage the fully observed trajectories of\nvotes, estimating the hidden intrinsic quality of each response and inferring two behavioral coef\ufb01cients\nfor community-level exploration. The proposed log-linear urn model is capable of generating non-\nexchangeable votes with great scalability to incorporate other factors such as length bias or other\ntextual features. As we are more able to observe social interactions as they are occurring and not just\nsummarized after the fact, we will increasingly be able to use models beyond exchangeability.\n\n11All numbers and p-values in paragraphs are weighted averages on all 83 communities, whereas Table 5 only\n\nincludes results for the major communities and their own weighted averages due to space limits.\n\n8\n\n\fReferences\n[1] D. J. Aldous. Exchangeability and related topics. In \u00c9cole d\u2019\u00c9t\u00e9 St Flour 1983, pages 1\u2013198. Springer-\n\nVerlag, 1985.\n\n[2] D. Blei, T. Grif\ufb01ths, M. Jordan, and J. Tenenbaum. Hierarchical topic models and the nested chinese\n\nrestaurant process. In Advances in Neural Information Processing System, NIPS \u201903, 2003.\n\n[3] D. M. Blei and P. I. Frazier. Distance dependent chinese restaurant processes. Journal of Machine Learning\n\nLearning Research, pages 2461\u20132488, 2011.\n\n[4] C. Danescu-Niculescu-Mizil, G. Kossinets, J. Kleinberg, and L. Lee. How opinions are received by online\ncommunities: A case study on Amazon.Com helpfulness votes. In Proceedings of World Wide Web, WWW\n\u201909, pages 141\u2013150, 2009.\n\n[5] A. Ghose and P. G. Ipeirotis. Designing novel review ranking systems: Predicting the usefulness and\nimpact of reviews. In Proceedings of the Ninth International Conference on Electronic Commerce, ICEC\n\u201907, pages 303\u2013310, 2007.\n\n[6] T. Joachims, L. Granka, B. Pan, H. Hembrooke, F. Radlinski, and G. Gay. Evaluating the accuracy of\nimplicit feedback from clicks and query reformulations in web search. ACM Transactions on Information\nSystems, 25(2), 2007.\n\n[7] S.-M. Kim, P. Pantel, T. Chklovski, and M. Pennacchiotti. Automatically assessing review helpfulness. In\nProceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, EMNLP \u201906,\n2006.\n\n[8] J. Liu, Y. Cao, C.-Y. Lin, Y. Huang, and M. Zhou. Low-quality product review detection in opinion\nsummarization. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language\nProcessing and Computational Natural Language Learning, EMNLP-CoNLL \u201907, pages 334\u2013342, 2007.\n\n[9] L. Mamykina, B. Manoim, M. Mittal, G. Hripcsak, and B. Hartmann. Design lessons from the fastest q&a\nsite in the west. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI\n\u201911, 2011.\n\n[10] L. Martin and P. Pu. Prediction of helpful reviews using emotion extraction. In Proceedings of the\n\nTwenty-Eighth AAAI Conference on Arti\ufb01cial Intelligence, AAAI \u201914, pages 1551\u20131557, 2014.\n\n[11] J. Otterbacher. \u2019helpfulness\u2019 in online communities: A measure of message quality. In Proceedings of the\n\nSIGCHI Conference on Human Factors in Computing Systems, CHI \u201909, pages 955\u2013964, 2009.\n\n[12] M. J. Salganik, P. S. Dodds, and D. J. Watts. Experimental study of inequality and unpredictability in an\n\narti\ufb01cial cultural market. Science, 311:854\u2013856, 2006.\n\n[13] M. J. Salganik and D. J. Watts. Leading the herd astray: An experimental study of self-ful\ufb01lling prophecies\n\nin an arti\ufb01cial cultural mmrket. Social Psychology Quarterly, 71:338\u2013355, 2008.\n\n[14] W. Shandwick. Buy it, try it, rate it: Study of consumer electronics purchase deicisions in the engagement\n\nera. KRC Research, 2012.\n\n[15] S. Siersdorfer, S. Chelaru, J. S. Pedro, I. S. Altingovde, and W. Nejdl. Analyzing and mining comments\n\nand comment ratings on the social web. ACM Trans. Web, pages 17:1\u201317:39, 2014.\n\n[16] R. Sipos, A. Ghosh, and T. Joachims. Was this review helpful to you?: It depends! context and voting\npatterns in online content. In International Conference on World Wide Web, WWW \u201914, pages 337\u2013348,\n2014.\n\n[17] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts. Recursive deep\nmodels for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference\non Empirical Methods in Natural Language Processing, EMNLP, pages 1631\u20131642. Association for\nComputational Linguistics, 2013.\n\n[18] Y. R. Tausczik, A. Kittur, and R. E. Kraut. Collaborative problem solving: A study of mathover\ufb02ow. In\n\nComputer-Supported Cooperative Work and Social Computing, CSCW\u2019 14, 2014.\n\n[19] Y. Yue, R. Patel, and H. Roehrig. Beyond position bias: Examining result attractiveness as a source of\npresentation bias in clickthrough data. In Proceedings of the 19th International Conference on World Wide\nWeb, WWW \u201910, 2010.\n\n9\n\n\f", "award": [], "sourceid": 2495, "authors": [{"given_name": "Moontae", "family_name": "Lee", "institution": "Cornell University"}, {"given_name": "Seok Hyun", "family_name": "Jin", "institution": "Cornell University"}, {"given_name": "David", "family_name": "Mimno", "institution": "Cornell University"}]}