{"title": "Modeling Clutter Perception using Parametric Proto-object Partitioning", "book": "Advances in Neural Information Processing Systems", "page_first": 118, "page_last": 126, "abstract": "Visual clutter, the perception of an image as being crowded and disordered, affects aspects of our lives ranging from object detection to aesthetics, yet relatively little effort has been made to model this important and ubiquitous percept. Our approach models clutter as the number of proto-objects segmented from an image, with proto-objects defined as groupings of superpixels that are similar in intensity, color, and gradient orientation features. We introduce a novel parametric method of merging superpixels by modeling mixture of Weibull distributions on similarity distance statistics, then taking the normalized number of proto-objects following partitioning as our estimate of clutter perception. We validated this model using a new $\\text{90}-$image dataset of realistic scenes rank ordered by human raters for clutter, and showed that our method not only predicted clutter extremely well (Spearman's $\\rho = 0.81$, $p < 0.05$), but also outperformed all existing clutter perception models and even a behavioral object segmentation ground truth. We conclude that the number of proto-objects in an image affects clutter perception more than the number of objects or features.", "full_text": "Modeling Clutter Perception using Parametric\n\nProto-object Partitioning\n\nChen-Ping Yu\n\nDepartment of Computer Science\n\nStony Brook University\n\ncheyu@cs.stonybrook.edu\n\nWen-Yu Hua\n\nDepartment of Statistics\n\nPennsylvania State University\n\nwxh182@psu.edu\n\nDimitris Samaras\n\nDepartment of Computer Science\n\nStony Brook University\n\nsamaras@cs.stonybrook.edu\n\nGregory J. Zelinsky\n\nDepartment of Psychology\n\nStony Brook University\n\nGregory.Zelinsky@stonybrook.edu\n\nAbstract\n\nVisual clutter, the perception of an image as being crowded and disordered, af-\nfects aspects of our lives ranging from object detection to aesthetics, yet relatively\nlittle effort has been made to model this important and ubiquitous percept. Our\napproach models clutter as the number of proto-objects segmented from an im-\nage, with proto-objects de\ufb01ned as groupings of superpixels that are similar in\nintensity, color, and gradient orientation features. We introduce a novel paramet-\nric method of clustering superpixels by modeling mixture of Weibulls on Earth\nMover\u2019s Distance statistics, then taking the normalized number of proto-objects\nfollowing partitioning as our estimate of clutter perception. We validated this\nmodel using a new 90-image dataset of real world scenes rank ordered by human\nraters for clutter, and showed that our method not only predicted clutter extremely\nwell (Spearman\u2019s \u03c1 = 0.8038, p < 0.001), but also outperformed all existing clut-\nter perception models and even a behavioral object segmentation ground truth. We\nconclude that the number of proto-objects in an image affects clutter perception\nmore than the number of objects or features.\n\n1\n\nIntroduction\n\nVisual clutter, de\ufb01ned colloquially as a \u201cconfused collection\u201d or a \u201ccrowded disorderly state\u201d, is\na dimension of image understanding that has implications for applications ranging from visualiza-\ntion and interface design to marketing and image aesthetics. In this study we apply methods from\ncomputer vision to quantify and predict human visual clutter perception.\nThe effects of visual clutter have been studied most extensively in the context of an object detection\ntask, where models attempt to describe how increasing clutter negatively impacts the time taken to\n\ufb01nd a target object in an image [19][25][29][18][6]. Visual clutter has even been suggested as a\nsurrogate measure for set size effect, the \ufb01nding that search performance often degrades with the\nnumber of objects in a scene [32]. Because human estimates of the number of objects in a scene\nare subjective and noisy - one person might consider a group of trees to be an object (a forest or a\ngrove) while another person might label each tree in the same scene as an \u201cobject\u201d, or even each\ntrunk or branch of every tree - it may be possible to capture this seminal search relationship in an\nobjectively de\ufb01ned measure of visual clutter [21][25]. One of the earliest attempts to model visual\nclutter used edge density, i.e.\nthe ratio of the number of edge pixels in an image to image size\n[19]. The subsequent feature congestion model ignited interest in clutter perception by estimating\n\n1\n\n\fFigure 1: How can we quantify set size or the number of objects in these scenes, and would this\nobject count capture the perception of scene clutter?\n\nimage complexity in terms of the density of intensity, color, and texture features in an image [25].\nHowever, recent work has pointed out limitations of the feature congestion model [13][21], leading\nto the development of alternative approaches to quantifying visual clutter [25][5][29][18].\nOur approach is to model visual clutter in terms of proto-objects: regions of locally similar features\nthat are believed to exist at an early stage of human visual processing [24]. Importantly, proto-objects\nare not objects, but rather the fragments from which objects are built. In this sense, our approach\n\ufb01nds a middle ground between features and objects. Previous work used blob detectors to segment\nproto-objects from saliency maps for the purpose of quantifying shifts of visual attention [31], but\nthis method is limited in that it results in elliptical proto-objects that do not capture the complexity\nor variability of shapes in natural scenes. Alternatively, it may be possible to apply standard image\nsegmentation methods to the task of proto-object discovery. While we believe this approach has\nmerit (see Section 4.3), it is also limited in that the goal of these methods is to approximate a human\nsegmented ground truth, where each segment generally corresponds to a complete and recognizable\nobject. For example, in the Berkeley Segmentation Dataset [20] people were asked to segment each\nimage into 2 to 20 equally important and distinguishable things, which results in many segments\nbeing actual objects. However, one rarely knows the number of objects in a scene, and ambiguity in\nwhat constitutes an object has even led some researchers to suggest that obtaining an object ground\ntruth for natural scenes is an ill-posed problem [21].\nOur clutter perception model uses a parametric method of proto-object partitioning that clusters su-\nperpixels, and requires no object ground truth. In summary, we create a graph having superpixels as\nnodes, then compute feature similarity distances between adjacent nodes. We use Earth Mover\u2019s Dis-\ntance (EMD) [26] to perform pair-wise comparisons of feature histograms over all adjacent nodes,\nand model the EMD statistics with mixture of Weibulls to solve an edge-labeling problem, which\nidenti\ufb01es and removes between-cluster edges to form isolated superpixel groups that are subse-\nquently merged. We refer to these merged image fragments as proto-objects. Our approach is based\non the novel \ufb01nding that EMD statistics can be modeled by a Weibull distribution (Section 2.2),\nand this allows us to model such similarity distance statistics with a mixture of Weibull distribution,\nresulting in extremely ef\ufb01cient and robust superpixel clustering in the context of our model. Our\nmethod runs in linear time with respect to the number of adjacent superpixel pairs, and has an end-\nto-end run time of 15-20 seconds for a typical 0.5 megapixel image, a size that many supervised\nsegmentation methods cannot yet accommodate using desktop hardware [2][8][14][23][34].\n\n2 Proto-object partitioning\n\n2.1 Superpixel pre-processing and feature similarity\n\nTo merge similar fragments into a coherent proto-object region, the term fragment and the measure\nof coherence (similarity) must be de\ufb01ned. We de\ufb01ne an image fragment as a group of pixels that\nshare similar low-level image features: intensity, color, and orientation. This conforms with pro-\ncessing in the human visual system, and also makes a fragment analogous to an image superpixel,\nwhich is a perceptually meaningful atomic region that contains pixels similar in color and texture\n[30]. However, superpixel segmentation methods in general produce a \ufb01xed number of superpixels\nfrom an image, and groups of nearby superpixels may belong to the same proto-object due to the in-\ntended over-segmentation. Therefore, we extract superpixels as image fragments for pre-processing,\n\n2\n\n\fand subsequently merge similar superpixels into proto-objects. We de\ufb01ne that a pair of adjacent su-\nperpixels belong to a coherent proto-object if they are similar in all three low-level image features.\nThus we need to determine a similarity threshold for each of the three features, that separates the\nsimilarity distance values into \u201csimilar\u201d, and \u201cdissimilar\u201d populations, detailed in Section 2.2.\nIn this work, the similarity statistics are based on comparing histograms of intensity, color, and ori-\nentation features from an image fragment. The intensity feature is a 1D 256 bin histogram, the color\nfeature is a 76\u00d776 (8 bit color) 2D histogram using hue and saturation from the HSV colorspace,\nand the orientation feature is a symmetrical 1D 360 bin histogram using gradient orientations, simi-\nlar to the HOG feature [10]. All three feature histograms are normalized to have the same total mass,\nsuch that bin counts sum to one.\nWe use Earth Mover\u2019s Distance (EMD) to compute the similarity distance between feature his-\ntograms [26], which is known to be robust to partially matching histograms. For any pair of adja-\ncent superpixels va and vb, their normalized feature similarity distances for each of the intensity,\ncolor, and orientation features are computed as: xn;f = EMD(va;f , vb;f )/(cid:91)EMDf , where xn;f de-\nnotes the similarity (0 is exactly the same, and 1 means completely opposite) between the nth pair\n(n = 1, ..., N) of nodes va and vb under feature f \u2208 {i, c, o} as intensity, color, and orientation.\n(cid:91)EMDf is the maximum possible EMD for each of the three image features; it is well de\ufb01ned in\nthis situation such that the largest difference between intensities is black to white, hues that are\n180\u25e6 apart, and a horizontal gradient against a vertical gradient. Therefore, (cid:91)EMDf normalizes\nxn;f \u2208 [0, 1]. In the subsequent sections, we explain our proposed method for \ufb01nding the adaptive\nsimilarity threshold from xf , which is the EMDs of all pairs of adjacent nodes .\n\n2.2 EMD statistics and Weibull distributon\n\nAny pair of adjacent superpixels are either similar enough to belong to the same proto-object, or they\nbelong to different proto-objects, as separated by the adaptive similarity threshold \u03b3f that is different\nfor every image. We formulate this as an edge labeling problem: given a graph G = (V, E), where\nva \u2208 V and vb \u2208 V are two adjacent nodes (superpixels) having edge ea,b \u2208 E, a (cid:54)= b between\nthem, also the nth edge of G. The task is to label the binary indicator variable yn;f = I(xn;f < \u03b3f )\non edge ea,b such that yn;f = 1 if xn;f < \u03b3f , which means va;f and vb;f are similar (belongs to the\nsame proto-object), otherwise yn;f = 0 if va;f and vb;f are dissimilar (belongs to different proto-\nobjects). Once \u03b3f is computed, removing the edges such that yf = 0 results in isolated clusters of\nlocally similar image patches, which are the desired groups of proto-objects.\nIntuitively, any pair of adjacent nodes is either within the same proto-object cluster, or between\ndifferent clusters (yn;f = {1, 0}), therefore we consider two populations (the within-cluster edges,\nand the between-cluster edges) to be modeled from the density of xf in a given image. In theory,\nthis would mean that the density of xf is a distribution exhibiting bi-modality, such that the left\nmode corresponds to the set of xf that are considered similar and coherent, while the right mode\ncontains the set of xf that represent dissimilarity. At \ufb01rst thought, applying k-means with k = 2\nor a mixture of two Gaussians would allow estimation of the two populations. However, there is\nno evidence showing that similarity distances follow symmetrical or normal distributions. In the\nfollowing, we argue that the similarity distances xf computed by EMD follow Weibull distribution,\nwhich is a distribution of the Exponential family that is skewed in shape.\n\n(cid:80)n\nWe de\ufb01ne EMD(P, Q) = ((cid:80)m\nij \u2264 qj, (cid:80)\n(cid:80)\nij \u2264 pi, (cid:80)\nj f(cid:48)\nij such that\nij \u2265 0, where P =\nj f(cid:48)\ni,j f(cid:48)\n{(x1, p1), ..., (xm, pm)} and Q = {(y1, q1), ..., (yn, qn)} are the two signatures to be compared,\n(cid:80)n\nand dij denotes a dissimilarity metric (i.e. L2 distance) between xi and yj in Rd. When P and\nQ are normalized to have the same total mass, EMD becomes identical to Mallows distance [17],\ni=1 |xi \u2212 yi|p)1/p, where X and Y are sorted vectors of the same size,\nde\ufb01ned as Mp(X, Y ) = ( 1\nn\nand Mallows distance is an Lp-norm based distance measurement. Furthermore, Lp-norm based\ndistance metrics are Weibull distributed if the two feature vectors to be compared are correlated\nand non-identically distributed [7]. We show that our features assumptions are satis\ufb01ed in Section\n4.1. Hence, we can model each feature of xf as a mixture of two Weibull distributions separately,\nand compute the corresponding \u03b3f as the boundary locations between the two components of the\nmixtures. Although the Weibull distribution has been used in modeling actual image features such\n\nij), with an optimal \ufb02ow f(cid:48)\n\n(cid:80)n\nijdij)/((cid:80)m\ni,j = min((cid:80)\ni pi,(cid:80)\nj f(cid:48)\n\ni\n\ni f(cid:48)\n\ni\n\nj qj), and f(cid:48)\n\n3\n\n\fas texture and edges [12][35], it has not been used to model EMD similarity distance statistics until\nnow.\n\n2.3 Weibull mixture model\n\nOur Weibull mixture model (WMM) takes the following general form:\nx \u2212 c\n\u03b1\n\n\u03c0k\u03c6k(x; \u03b8k) , \u03c6(x; \u03b1, \u03b2, c) =\n\nW K(x; \u03b8) =\n\n\u03b2\n\u03b1\n\n(\n\nK(cid:88)\n\nk=1\n\n)\u03b2\u22121e\u2212( x\u2212c\n\u03b1 )\u03b2\n\n(1)\n\nparameter \u03c0 such that(cid:80)\n\nwhere \u03b8k = (\u03b1k, \u03b2k, ck) is the parameter vector for the kth mixture component, and \u03c6 denotes the\nthree-parameter Weibull pdf with the scale (\u03b1), shape (\u03b2), and location (c) parameter, and the mixing\nk \u03c0k = 1. In this case, our two-component WMM contains a 7-parameter\n\nvector \u03b8 = (\u03b11, \u03b21, c1, \u03b12, \u03b22, c2, \u03c0) that yields the following complete form:\n\nx \u2212 c2\n\u03b12\n\n\u03b22\n\u03b12\n\nW 2(x; \u03b8) = \u03c0(\n\nx \u2212 c1\n\u03b11\n\n\u03b21\n\u03b11\n\n)\u03b22\u22121)e\n\n\u2212( x\u2212c2\n\u03b12\n\n)\u03b22\n\n)\u03b21\u22121)e\n\n\u2212( x\u2212c1\n\u03b11\n\n)\u03b21 + (1 \u2212 \u03c0)(\n\n(\n\n(\n\n(2)\nTo estimate the parameters of W 2(x; \u03b8), we tested two optimization methods: maximum likelihood\nestmation (MLE), and nonlinear least squares minimization (NLS). Both MLE and NLS requires an\ninitial parameter vector \u03b8(cid:48) to begin the optimization, and the choice of \u03b8(cid:48) is crucial to the convergence\nof the optimal parameter vector \u02c6\u03b8. In our case, the initial guess is quite well de\ufb01ned: for any node\nof a speci\ufb01c feature vj;f , and its set of adjacent neighbors vN\nj;f = N (vj;f ), the neighbor that is most\nsimilar to vj;f is most likely to belong to the same cluster as vj;f , and it is especially true under an\nover-segmention scenario. Therefore, the initial guess for the \ufb01rst mixture component \u03c61;f is the\nj;f ))|vj;f ; j = 1, ..., z, f \u2208 {i, c, o}},\nMLE of \u03c61;f (\u03b8(cid:48)\n1, c(cid:48)\nwhere z is the total number of superpixels, and x(cid:48)\n1, \u03b2(cid:48)\n1), several\n2 can be computed for the re-start purpose via MLE from the data taken by P r(xf|\u03b8(cid:48)\n\u03b8(cid:48)\n1) > p, where\nP r is the cumulative distribution function, and p is a range of percentiles. Together, they form the\ncomplete initial guess parameter vector \u03b8(cid:48) = (\u03b1(cid:48)\n2, \u03c0(cid:48)) where \u03c0(cid:48) = 0.5.\n\nf \u2282 xf . After obtaining \u03b8(cid:48)\n\nf = {min(EMD(vj;f , vN\n\nf ), such that x(cid:48)\n\n1 = (\u03b1(cid:48)\n\n1;f ; x(cid:48)\n\n1, \u03b1(cid:48)\n\n2, \u03b2(cid:48)\n\n1, \u03b2(cid:48)\n\n2, c(cid:48)\n\n1, c(cid:48)\n\n2.3.1 Parameter estimation\n\nMaximum likelihood estimation (MLE) estimates the parameters by maximizing the log-likelihood\nfunction of the observed samples. The log-likelihood function of W 2(x; \u03b8) is given by:\n\nN(cid:88)\n\nxn \u2212 c2\n\nxn \u2212 c1\n\n(\n\n\u2212( xn\u2212c1\n\n\u2212( xn\u2212c2\n\n\u03b12\n\n(\n\n\u03b11\n\nn=1\n\n\u03b12\n\n\u03b11\n\n\u03b21\n\u03b11\n\n\u03b22\n\u03b12\n\nln{\u03c0(\n\n)\u03b21\u22121)e\n\n)\u03b22\u22121)e\n\nlnL(\u03b8; x) =\n\n)\u03b21 +(1\u2212\u03c0)(\n\n)\u03b22}\n(3)\nDue to the complexity of this log-likelihood function and the presence of the location parameters\nc1,2, we adopt the Nelder-Mead method as a derivative-free optimization of MLE that performs pa-\nrameter estimation with direct-search [22][16], by minimizing the negative log-likelihood function\nof Eq. 3.\nFor the NLS optimization method, \ufb01rst xf are approximated with histograms much like a box \ufb01lter\nthat smoothes a curve. The appropriate histogram bin-width for data representation is computed\nby w = 2(IQR)n\u22121/3, where IQR is the interquartile range of the data with n observations [15].\nThis allows us to optimize a two component WMM to the height of each bin with NLS as a curve\n\ufb01tting problem, which is a robust alternative to MLE when the noise level can be reduced by some\napproximation scheme. Then, we \ufb01nd the least squares minimizer by using the trust-region method\n[27][28]. Both the Nelder-Mead MLE algorithm and the NLS method are detailed in the supple-\nmentary material.\nFigure 2 shows the WMM \ufb01t using the Nelder-Mead MLE method. In addition to the good \ufb01t of the\nmixture model to the data, it also shows that the right skewed data (EMD statistics) is remarkably\nWeibull, this further validates that EMD statistics follow Weibull distribution both in theory and\nexperiments.\n\n4\n\n\fFigure 2: (a) original image, (b) after superpixel pre-processing [1] (977 initial segments), (c) \ufb01nal\nproto-object partitioning result (150 segments). Each \ufb01nal segment is shown with its mean RGB\nvalue to approximate proto-object perception. (d) W 2(xf ; \u03b8f ) optimized using the Nelder-Mead\nalgorithm for intensity, (e) color, and (f) orientation based on the image in (b). The red line indicates\nthe individual Weibull components; and the blue line is the density of the mixture W 2(xf ; \u03b8f ).\n\n2.4 Visual clutter model with model selection\n\nAt times, the dissimilar population can be highly mixed in with the similar population, the density of\nwhich would resemble more of a single Weibull in shape such as Figure 2d. Therefore, we \ufb01t a single\nWeibull as well as a two component WMM over xf , and apply the Akaike Information Criterion\n(AIC) to prevent any possible over-\ufb01ttings by the two component WMM. AIC tends to place a\nheavier penalty on the simpler model, which is suitable in our case to ensure that the preference\nis placed on the two-population mixture models. For models optimized using MLE, the standard\nAIC is used; for the NLS cases, the corrected AIC (AICc) for smaller sample size (generally when\nn/k \u2264 40) with residual sum of squares (RSS) is used, and it is de\ufb01ned as AICc = n ln(RSS/n) +\n2k +2k(k +1)/(n\u2212k\u22121), where k is the number of model parameters, n is the number of samples.\nThe optimal \u03b3f can then be determined as follows:\n\ns.t. \u03c01\u03c61;f (x|\u03b81;f ) = \u03c02\u03c62;f (x|\u03b82;f ) AIC(W 2) \u2264 AIC(W 1)\n\n\uf8f1\uf8f2\uf8f3max(x, \u0001),\n\n\u03b3f =\n\nmax(\u2212\u03b11(ln(1 \u2212 \u03c4 ))1/\u03b21, \u0001)\n\nOtherwise\n\n(4)\n\nThe \ufb01rst case is when the mixture model is preferred, then the optimal \u03b3f is the crossing point\nbetween the mixture components, and the equality can be solved in linear time by searching over the\nvalues of the vector xf ; in the second case when the single Weibull is preferred by model selection,\n\u03b3f is calculated by the inverse CDF of W 1, which computes the location of a given percentile\nparameter \u03c4. Note that \u03b3f is lower bounded by a tolerance parameter \u0001 in both cases to prevent\nunusual behaviors when an image is nearly blank (\u03b3f \u2208 [\u0001, 1]), making \u03c4 and \u0001 the only model\nparameters in our framework.\nWe perform Principle Component Analysis (PCA) on the similarity distance values xf of intensity,\ncolor, and orientation and obtain the combined distance feature value by projecting xf to the \ufb01rst\nprinciple component, such that the relative importance of each distance feature is captured by its\nvariance through PCA. This projected distance feature is used to construct a minimum spanning tree\nover the superpixels to form the structure of graph G, which weakens the inter-cluster connectivity\nby removing cycles and other excessive graph connections. Finally, each edge of G is labeled\n\n5\n\n\faccording to Section 2.2 given the computed \u03b3f , such that an edge is labeled as 1 (similar) only if\nthe pair of superpixels are similar in all three features. Edges labeled as 0 (dissimilar) are removed\nfrom G to form isolated clusters (proto-objects), and our visual clutter model produces a normalized\nclutter measure that is between 0 and 1 by dividing the number of proto-objects by the initial number\nof superpixels such that it is invariant to different scales of superpixel over-segmentation.\n\n3 Dataset and ground truth\n\nVarious in-house image datasets have been used in previous work to evaluate their models of visual\nclutter. The feature congestion model was evaluated on 25 images of US city/road maps and weather\nmaps [25]; the models in [5] and [29] were evaluated on another 25 images consisting of 6, 12, or 24\nsynthetically generated objects arranged into a grid ; and the model from [18] used 58 images of six\nmap or chart categories (airport terminal maps, \ufb02owcharts, road maps, subway maps, topographic\ncharts, and weather maps). In each of these datasets, each image must be rank ordered for visual\nclutter with respect to every other image in the set by the same human subject, which is a tiring and\ntime consuming process. This rank ordering is essential for a clutter perception experiment as it\nestablishes a stable clutter metric that is meaningful across participants; alas it limits the dataset size\nto the number of images each individual observer can handle. Absolute clutter scales are undesirable\nas different raters might use different ranges on this scale.\nWe created a comparatively large clutter perception dataset consisting of 90 800\u00d7600 real world\nimages sampled from the SUN Dataset images [33] for which there exists human segmentations of\nobjects and object counts. These object segmentations serve as one of the ground truths in our study.\nThe high resolution of these images is also important for the accurate perception and assessment\nof clutter. The 90 images were selected to constitute six groups based on their ground truth object\ncounts, with 15 images in each group. Speci\ufb01cally, group 1 had images with object counts in the\n1-10 range, group 2 had counts in the 11-20 range, up to group 6 with counts in the 51-60 range.\nThese 90 images were rated in the laboratory by 15 college-aged participants whose task was to\norder the images in terms of least to most perceived visual clutter. This was done by displaying each\nimage one at a time and asking participants to insert it into an expanding set of previously rated\nimages. Participants were encouraged to take as much time as they needed, and were allowed to\nfreely scroll through the existing set of clutter rated images when deciding where to insert the new\nimage. A different random sequence of images was used for each participant (in order to control for\nbiases and order effects), and the entire task lasted approximately one hour. The average correlation\n(Spearman\u2019s rank-order correlation) over all pairs of participants was 0.6919 (p < 0.001), indicating\ngood agreement among raters. We used the median ranked position of each image as the ground truth\nfor clutter perception in our experiments.\n\n4 Experiment and results\n\n4.1\n\nImage feature assumptions\n\nidentical and correlated random variables Xi, the sum(cid:80)N\n\nIn their demonstration that similarity distances adhere to a Weibull dstribution, Burghouts et al. [7]\nderived and related Lp-norm based distances from the statistics of sums [3][4] such that for non-\ni=1 Xi is Weibull distributed if Xi are\nupper-bounded with a \ufb01nite N, where Xi = |si \u2212 Ti|p such that N is the dimensionality of the\nfeature vector, i is the index, and s, t \u2208 T are different sample vectors of the same feature.\nThe three image features used in this model are \ufb01nite and upper bounded, and we follow the pro-\ncedure from [7] with L2 distance to determine whether they are correlated. We consider distances\nfrom one reference superpixel feature vector s to 100 other randomly selected superpixel feature\nvectors T (of the same feature), and compute the differences at index i such that we are obtaining\nthe random variable Xi = |si \u2212 Ti|p. Pearson\u2019s correlation is then used to determine the relation-\nship between Xi and Xj, i (cid:54)= j at a con\ufb01dence level of 0.05. This procedure is repeated 500 times\nper image for all three feature types over all 90 images. As predicted, we found an almost perfect\ncorrelation between feature value differences for each of the features tested: Intensity: 100%, Hue:\n99.2%, Orientation: 98.97%). This con\ufb01rms that the low level image features used in this study\nfollow a Weibull distribution.\n\n6\n\n\fWMM-mle WMM-nls MS[9] GB[11]\n0.6612\n\n0.8038\n\n0.7966\n\n0.7262\n\nPL[6]\n0.6439\n\nED[19]\n0.6231\n\nFC[25]\n0.5337\n\n# Obj\n0.5255\n\nC3[18]\n0.4810\n\nTable 1: Correlations between human clutter perception and all the evaluated methods. WMM is the\nWeibull mixture model underlying our proto-object partitioning approach, with both optimization\nmethods.\n\n4.2 Model evaluation\nWe ran our model with different parameter settings of \u0001 \u2208 {0.01, 0.02, ..., 0.20} and \u03c4 \u2208\n{0.5, 0.6, ..., 0.9} using SLIC superpixels [1] initialized at 1000 seeds. We then correlated the\nnumber of proto-objects formed after superpixel merging with the ground truth behavioral clutter\nperception estimates by computing the Spearman\u2019s Rank Correlation (Spearman\u2019s \u03c1) following the\nconvention of [25][5][29][18].\nA model using MLE as the optimization method achieved the highest correlation, \u03c1 = 0.8038,\np < 0.001 with \u0001 = 0.14 and \u03c4 = 0.8. Because we did not have separate training/testing sets,\nwe performed 10-fold cross-validation and obtained an average testing correlation of r = 0.7599,\np < 0.001. When optimized using NLS, the model achieved a maximum correlation of \u03c1 = 0.7966,\np < 0.001 with \u0001 = 0.14 and \u03c4 = 0.4, and the corresponding 10-fold cross-validation yielded an\naverage testing correlation of r = 0.7375, p < 0.001. The high cross-validation averages indicate\nthat our model is highly robust, and generalizable to unseen data.\nIt is worth pointing out that, the optimal value of the tolerance parameter \u0001 showed a peak correlation\nat 0.14. To the extent that this is meaningful and extends to people, it suggests that visual clutter\nperception may ignore feature dissimilarity on the order of 14% when deciding whether two adjacent\nregions are similar and should be merged.\nWe compared our model to four other state-of-the-art models of clutter perception: the feature con-\ngestion model [25], the edge density method [19], the power-law model [6], and the C3 model [18].\nTable 1 shows that our model signi\ufb01cantly out-performed all of these previously reported methods.\nThe relatively poor performance of the recent C3 model was surprising, and can probably be at-\ntributed to the previous evaluation of that model using charts and maps rather than arbitrary realistic\nscenes (personal communication with authors). Collectively, these results suggest that a model that\nmerges superpixels into proto-objects best describes human clutter perception, and that the bene\ufb01t of\nusing a proto-object model for clutter prediction is not small; our model resulted in an improvement\nof at least 15% over existing models of clutter perception. Although we did not record run-time\nstatistics on the other models, our model, implemented in Matlab1, had an end-to-end (excluding\nsuperpixel pre-processing) run-time of 15-20 seconds using 800\u00d7600 images running on an Win7\nIntel Core i-7 computer with 8 Gb RAM.\n\n4.3 Comparison to image segmentation methods\n\nWe also attempted to compare our method to state of the art image segmentation algorithms such\nas gPb-ucm [2], but found that the method was unable to process our image dataset using either an\nIntel Core i-7 machine with 8 Gb RAM or an Intel Xeon machine with 16 Gb RAM, at the high\nimage resolutions required by our behavioral clutter estimates. A similar limitation was found for\nimage segmentation methods that utilizes gPb contour detection as pre-processing, such as [8][14],\nwhile [23][34] took 10 hours on a single image and did not converge.\nTherefore, we limit our evaluation to mean-shift [9] and Graph-based method [11], as they are able\nto produce variable numbers of segments based on the unsupervised partitioning of the 90 images\nfrom our dataset. Despite using the best dataset parameter settings for these unsupervised methods,\nour method remains the highest correlated model with the clutter perception ground truth as shown\nin Table 1, and that methods allowing quanti\ufb01cation of proto-object set size (WMM, Mean-shift,\nand Graph-based) outperformed all of the previous clutter models .\nWe also correlated the number of objects segmented by humans (as provided in the SUN Dataset)\nwith the clutter perception ground truth, denoted as # obj in Table 1. Interestingly, despite object\n\n1Code is available at mysb\ufb01les.stonybrook.edu/~cheyu/projects/proto-objects.html\n\n7\n\n\fFigure 3: Top: Four images from our dataset, rank ordered for clutter perception by human raters,\nmedian clutter rank order from left to right: 6, 47, 70, 87. Bottom: Corresponding images after\nparametric proto-object partitioning, median clutter rank order from left to right: 7, 40, 81, 83.\n\ncount being a human-derived estimate, it produced among the lowest correlations with clutter per-\nception. This suggests that clutter perception is not determined by simply the number of objects in\na scene; it is the proto-object composition of these objects that is important.\n\n5 Conclusion\n\nWe proposed a model of visual clutter perception based on a parametric image partitioning method\nthat is fast and able to work on large images. This method of segmenting proto-objects from an im-\nage using mixture of Weibull distributions is also novel in that it models similarity distance statistics\nrather than feature statistics obtained directly from pixels. Our work also contributes to the behav-\nioral understanding of clutter perception. We showed that our model is an excellent predictor of\nhuman clutter perception, outperforming all existing clutter models, and predicts clutter perception\nbetter than even a behavioral segmentation of objects. This suggests that clutter perception is best\ndescribed at the proto-object level, a level intermediate to that of objects and features. Moreover,\nour work suggests a means of objectively quantifying a behaviorally meaningful set size for scenes,\nat least with respect to clutter perception. We also introduced a new and validated clutter perception\ndataset consisting of a variety of scene types and object categories. This dataset, the largest and most\ncomprehensive to date, will likely be used widely in future model evaluation and method compari-\nson studies. In future work we plan to extend our parametric partitioning method to general image\nsegmentation and data clustering problems, and to use our model to predict human visual search\nbehavior and other behaviors that might be affected by visual clutter.\n\n6 Acknowledgment\n\nWe appreciate the authors of [18] for sharing and discussing their code, Dr. Burghouts for providing\ndetailed explanations to the feature assumption part in [7], and Dr. Matthew Asher for providing\ntheir human search performance data on their work in Journal of Vision, 2013. This work was\nsupported by NIH Grant R01-MH063748 to G.J.Z., NSF Grant IIS-1111047 to G.J.Z. and D.S., and\nthe SUBSAMPLE Project of the DIGITEO Institute, France.\n\nReferences\n\n[1] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk. Slic superpixels compared to state-\n\nof-the-art superpixel methods. IEEE TPAMI, 2012.\n\n[2] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation.\n\nIEEE TPAMI, 2010.\n\n[3] E. Bertin. Global fuctuations and gumbel statistics. Physical Review Letters, 2005.\n\n8\n\n\f[4] E. Bertin and M. Clusel. Generalised extreme value statistics and sum of correlated variables. Journal of\n\nPnysics A, 2006.\n\n[5] M. J. Bravo and H. Farid. Search for a category target in clutter. Perception, 2004.\n[6] M. J. Bravo and H. Farid. A scale invariant measure of clutter. Journal of Vision, 2008.\n[7] G. J. Burghouts, A. W. M. Smeulders, and J.-M. Geusebroek. The distribution family of similarity dis-\n\ntances. In NIPS, 2007.\n\n[8] J. Carreira and C. Sminchisescu. Constrained parametric min-cuts for automatic object segmentation. In\n\nCVPR, 2010.\n\n[9] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE TPAMI,\n\n2002.\n\n[10] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.\n[11] P. F. Felzenszwalb and H. D. P. Ef\ufb01cient graph-based image segmentation. In ICCV, 2004.\n[12] J.-M. Geusebroek and A. W. Smeulders. A six-stimulus theory for stochastic texture. IJCV, 2005.\n[13] J. M. Henderson, M. Chanceaux, and T. J. Smith. The in\ufb02uence of clutter on real-world scene search:\n\nEvidence from search ef\ufb01ciency and eye movements. Journal of Vision, 2009.\n\n[14] A. Ion, J. Carreira, and C. Sminchisescu. Image segmentation by \ufb01gure-ground composition into maximal\n\ncliques. In ICCV, 2011.\n\n[15] A. J. Izenman. Recent developments in nonparametric density estimation. Journal of the American\n\nStatistical Association, 1991.\n\n[16] J. C. Lagarias, J. A. Reeds, M. H. Wright, and P. E. Wright. Convergence properties of the nelder-mead\n\nsimplex method in low dimensions. SIAM Journal on Optimization, 1998.\n\n[17] E. Levina and P. Bickel. The earth mover\u2019s distance is the mallows distance: some insights from statistics.\n\nIn ICCV, 2001.\n\n[18] M. C. Lohrenz, J. G. Trafton, R. M. Beck, and M. L. Gendron. A model of clutter for complex, multivari-\n\nate geospatial displays. Human Factors, 2009.\n\n[19] M. L. Mack and A. Oliva. Computational estimation of visual complexity. In the 12th Annual Object,\n\nPerception, Attention, and Memory Conference, 2004.\n\n[20] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its\n\napplication to evaluating segmentation algorithms and measuring ecological statistics. In ICCV, 2001.\n\n[21] M. B. Neider and G. J. Zelinsky. Cutting through the clutter: searching for targets in evolving complex\n\nscenes. Journal of Vision, 2011.\n\n[22] J. A. Nelder and R. Mead. A simplex method for function minimization. The computer journal, 1965.\n[23] S. R. Rao, H. Mobahi, A. Y. Yang, S. Sastry, and Y. Ma. Natural image segmentation with adaptive texture\n\nand boundary encoding. In ACCV, 2009.\n\n[24] R. A. Rensink. Seeing, sensing, and scrutinizing. Vision Research, 2000.\n[25] R. Rosenholtz, Y. Li, and L. Nakano. Measuring visual clutter. Journal of Vision, 2007.\n[26] Y. Rubner, C. Tomasi, and L. J. Guibas. A metric for distributions with applications to image databases.\n\nIn ICCV, 1998.\n\n[27] T. Steihaug. The conjugate gradient method and trust regions in large scale optimization. SIAM Journal\n\non Numerical Analysis, 1983.\n\n[28] P. L. Toint. Towards an ef\ufb01cient sparsity exploiting newton method for minimization. Sparse Matrices\n\nand Their Uses, 1981.\n\n[29] R. van den Berg, F. W. Cornelissen, and J. B. T. M. Roerdink. A crowding model of visual clutter. Journal\n\nof Vision, 2009.\n\n[30] O. Veksler, Y. Boykov, and P. Mehrani. Superpixels and supervoxels in an energy optimization framework.\n\nIn ECCV, 2010.\n\n[31] M. Wischnewski, A. Belardinelli, W. X. Schneider, and J. J. Steil. Where to look next? combining static\n\nand dynamic proto-objects in a tva-based model of visual attention. Cognitive Computation, 2010.\n\n[32] J. M. Wolfe. Visual search. Attention, 1998.\n[33] J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. SUN database: Large-scale scene recognition\n\nfrom abbey to zoo. In CVPR, 2010.\n\n[34] A. Y. Yang, J. Wright, Y.Ma, and S. Sastry. Unsupervised segmentation of natural images via lossy data\n\ncompression. CVIU, 2008.\n\n[35] V. Yanulevskaya and J.-M. Geusebroek. Signi\ufb01cance of the Weibull distribution and its sub-models in\n\nnatural image statistics. In Int. Conference on Computer Vision Theory and Applications, 2009.\n\n9\n\n\f", "award": [], "sourceid": 116, "authors": [{"given_name": "Chen-Ping", "family_name": "Yu", "institution": "Stony Brook University"}, {"given_name": "Wen-Yu", "family_name": "Hua", "institution": "Penn State University"}, {"given_name": "Dimitris", "family_name": "Samaras", "institution": "Stony Brook University"}, {"given_name": "Greg", "family_name": "Zelinsky", "institution": "Stony Brook University"}]}