{"title": "Higher Order Priors for Joint Intrinsic Image, Objects, and Attributes Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 557, "page_last": 565, "abstract": "Many methods have been proposed to recover the intrinsic scene properties such as shape, reflectance and illumination from a single image. However, most of these models have been applied on laboratory datasets. In this work we explore the synergy effects between intrinsic scene properties recovered from an image, and the objects and attributes present in the scene. We cast the problem in a joint energy minimization framework; thus our model is able to encode the strong correlations between intrinsic properties (reflectance, shape, illumination), objects (table, tv-monitor), and materials (wooden, plastic) in a given scene. We tested our approach on the NYU and Pascal datasets, and observe both qualitative and quantitative improvements in the overall accuracy.", "full_text": "Higher Order Priors for Joint Intrinsic Image,\n\nObjects, and Attributes Estimation\n\nVibhav Vineet\n\nOxford Brookes University, UK\n\nvibhav.vineet@gmail.com\n\nCarsten Rother\n\nTU Dresden, Germany\n\ncarsten.rother@tu-dresden.de\n\nPhilip H.S. Torr\n\nUniversity of Oxford, UK\nphilip.torr@eng.ox.ac.uk\n\nAbstract\n\nMany methods have been proposed to solve the problems of recovering intrinsic\nscene properties such as shape, re\ufb02ectance and illumination from a single image,\nand object class segmentation separately. While these two problems are mutually\ninformative, in the past not many papers have addressed this topic. In this work we\nexplore such joint estimation of intrinsic scene properties recovered from an im-\nage, together with the estimation of the objects and attributes present in the scene.\nIn this way, our uni\ufb01ed framework is able to capture the correlations between\nintrinsic properties (re\ufb02ectance, shape, illumination), objects (table, tv-monitor),\nand materials (wooden, plastic) in a given scene. For example, our model is able to\nenforce the condition that if a set of pixels take same object label, e.g. table, most\nlikely those pixels would receive similar re\ufb02ectance values. We cast the problem\nin an energy minimization framework and demonstrate the qualitative and quanti-\ntative improvement in the overall accuracy on the NYU and Pascal datasets.\n\nIntroduction\n\n1\nRecovering scene properties (shape, illumination, re\ufb02ectance) that led to the generation of an image\nhas been one of the fundamental problems in computer vision. Barrow and Tenebaum [13] posed\nthis problem as representing each scene properties with its distinct \u201cintrinsic\u201d images. Over the\nyears, many decomposition methods have been proposed [5, 16, 17], but most of them focussed on\nrecovering a re\ufb02ectance image and a shading1 image without explicitly modelling illumination or\nshape. But in the recent years a breakthrough in the research on intrinsic images came with the works\nof Barron and Malik [1-4] who presented an algorithm that jointly estimated the re\ufb02ectance, the\nillumination and the shape. They formulate this decomposition problem as an energy minimization\nproblem that captures prior information about the structure of the world.\nFurther, recognition of objects and their material attributes is central to our understanding of the\nworld. A great deal of work has been devoted to estimating the objects and their attributes in the\nscene: Shotton et.al. [22] and Ladicky et.al. [9] propose approaches to estimate the object labels at\nthe pixel level. Separately, Adelson [20], Farhadi et.al. [6], Lazebnik et.al. [23] de\ufb01ne and estimate\nthe attributes at the pixel, object and scene levels. Some of these attributes are material properties\nsuch as woollen, metallic, shiny, and some are structural properties such as rectangular, spherical.\nWhile these methods for estimating the intrinsic images, objects and attributes have separately been\nsuccessful in generating good results on laboratory and real-world datasets, they fail to capture the\nstrong correlation existing between these properties. Knowledge about the objects and attributes\nin the image can provide strong prior information about the intrinsic properties. For example, if a\nset of pixels takes the same object label, e.g. table, most likely those pixels would receive similar\nre\ufb02ectance values. Thus recovering the objects and their attributes can help reduce the ambiguities\npresent in the world leading to better estimation of the re\ufb02ectance and other intrinsic properties.\n\n1shading is the product of some shape and some illumination model which includes effects such as shadows,\n\nindirect lighting etc.\n\n1\n\n\fInput Image\n\nInput Depth Image\n\nRe\ufb02ectance\n\nShading\n\nDepth\n\nObject\n\nAttributes\n\nObject-color coding\n\nAttribute-color coding\nFigure 1: Given a RGBD image, our algorithm jointly estimates the intrinsic properties such as\nre\ufb02ectance, shading and depth maps, along with the per-pixel object and attribute labels.\n\nAdditionally such a decomposition might be useful for per-pixel object and attribute segmentation\ntasks. For example, using re\ufb02ectance (illumination invariant) should improve the results-when esti-\nmating per-pixel object and attribute labels [24]. Moreover if a set of pixels have similar re\ufb02ectance\nvalues, they are more likely to have the same object and attribute class.\nSome of the previous research has looked at the correlation of objects and intrinsic properties by\npropagating results from one step to the next. Osadchy et.al. [18] use specular highlights to improve\nrecognition of transparent, shiny objects. Liu et.al. [15] recognize material categories utilizing the\ncorrelation between the materials and their re\ufb02ectance properties (e.g. glass is often translucent).\nWeijer et.al. [14] use knowledge of the objects present in the scene to better separate the illumination\nfrom the re\ufb02ectance images. However, the problem with these approaches is that the errors in one\nstep can propagate to the next steps with no possibility of recovery. Joint estimation of the intrinsic\nimages, objects and attributes can be used to overcome these issues. For instance, in the context of\njoint object recognition and depth estimation such positive synergy effects have been shown in e.g.\n[8].\nIn this work, our main contribution is to explore such synergy effects existing between the intrinsic\nproperties, objects and material attributes present in a scene (see Fig. 1). Given an image, our\nalgorithm jointly estimates the intrinsic properties such as re\ufb02ectance, shading and depth maps,\nalong with per-pixel object and attribute labels. We formulate it in a global energy minimization\nframework, and thus our model is able to enforce the consistency among these terms. Finally,\nwe use an approximate dual decomposition based strategy to ef\ufb01ciently perform inference in the\njoint model consisting of both the continuous (re\ufb02ectance, shape and illumination) and discrete\n(objects and attributes) variables. We demonstrate the potential of our approach on the aNYU and\naPascal datasets, which are extended versions of the NYU [25] and Pascal [26] datasets with per-\npixel attribute labels. We evaluate both the qualitative and quantitative improvements for the object\nand attribute labelling, and qualitative improvement for the intrinsic images estimation.\nWe introduce the problem in Sec. 2. Section 3 provides details about our joint model, section 4\ndescribes our inference and learning, Sec. 5 and 6 provide experimentation and discussion.\n2 Problem Formulation\nOur goal is to jointly estimate the intrinsic properties of the image, i.e.\nre\ufb02ectance, shape and\nillumination, along with estimating the objects and attributes at the pixel level, given an image\narray \u00afC = ( \u00afC1... \u00afCV ) where \u00afCi \u2208 R3 is the ith pixel\u2019s associated RGB value in the image with\ni \u2208 V = {1...V }. Before going into the details of the joint formulation, we consider the formulations\nfor independently solving these problems. We \ufb01rst brie\ufb02y describe the SIRFS (shape, illumination\nand re\ufb02ectance from shading) model [2] for estimating the intrinsic properties for a single given\nobject, and then a CRF model for estimating objects, and attributes [12].\n2.1 SIRFS model for a single, given object mask\nWe build on the SIRFS model [2] for estimating the intrinsic properties of an image. They formu-\nlate the problem of recovering the shape, illumination and re\ufb02ectance as an energy minimization\nproblem given an image. Let R = (R1...RV ), Z = (Z1...ZV ) be the re\ufb02ectance, and depth maps\nrespectively, where Ri \u2208 R3 and Zi \u2208 R3, and the illumination L be a 27-dimensional vector of\nspherical harmonics [10]. Further, let S(Z, L) be a function that generates a shading image given\nthe depth map Z and the illumination L. Here Si \u2208 R3 and subsumes all light-dependent properties,\ne.g. shadows, inter-re\ufb02ections (refer to [2] for details). The SIRFS model then minimizes the energy\n\nminimizeR,Z,L ESIRFS = ER(R) + EZ(Z) + EL(L)\n\n\u00afC = R \u00b7 S(Z, L)\n\n(1)\n\nsubject to\n\n2\n\n\fwhere \u201d\u00b7\u201d represents componentwise multiplication, and ER(R), EZ(Z) and EL(L) are the costs\nfor the re\ufb02ectance, depth and illumination respectively. The most likely solution is then estimated by\nusing a multi-scale L-BFGS, a limited-memory approximation of the Broyden-Fletcher-Goldfarb-\nShanno algorithm [2], strategy which in practice \ufb01nds better local optima than other gradient descent\nstrategies. The SIRFS model is limited to estimating the intrinsic properties for a single object mask\nwithin an image. The recently proposed Scene-SIRFS model [4] proposes an approach to recover\nthe intrinsic properties of whole image by embedding a mixture of shapes in a soft segmentation\nof the scene. In Sec. 3 we will also extend the SIRFS model to handle multiple objects. The main\ndifference to Scene-SIRFS is that we perform joint optimization over the object (and attributes)\nlabelling and intrinsic image properties per-pixel.\n2.2 Multilabel Object and Attribute Model\nThe problem of estimating the per-pixel objects and attributes labels can also be formulated in a\nCRF framework [12]. Let O = (O1...OV ) and A = (A1...AV ) be the object and attribute variables\nassociated with all V pixels, where each object variable Oi takes one out of K discrete labels such as\ntable, monitor, or \ufb02oor. Each attribute variable Ai takes a label from the power set of the M attribute\nlabels, for example the subset of attribute labels can be Ai = {red, shiny, wet}. Ef\ufb01cient inference\ni \u2208\nis performed by \ufb01rst representing each attributes subset Ai by M binary attribute variables Am\n{0, 1}, meaning that Am\ni = 0.\nUnder this assumption, the most likely solution for the objects and the attributes correspond to\nminimizing the following energy function\n\ni = 1 if the ith pixel takes the mth attribute and it is absent when Am\n\nEOA(O, A) =\n\n\u03c8i(Oi) +\n\n\u03c8i,m(Am\n\ni )+\n\n\u03c8ij(Oi, Oj)+\n\n\u03c8ij(Am\n\ni , Am\nj )\n\n(2)\n\n(cid:88)\n\ni\u2208V\n\n(cid:88)\n\n(cid:88)\n\nm\n\ni\u2208V\n\n(cid:88)\n\ni<j\u2208V\n\n(cid:88)\n\n(cid:88)\n\ni<j\u2208V\n\nm\n\ni , Am\n\ni ) are the object and per-binary attribute dependent unary terms respec-\nHere \u03c8i(Oi) and \u03c8i,m(Am\ntively. Similarly, \u03c8ij(Oi, Oj) and \u03c8ij(Am\nj ) are the pairwise terms de\ufb01ned over the object and\nper-binary attribute variables. Finally the best con\ufb01guration for the object and attributes are esti-\nmated using a mean-\ufb01eld based inference approach [12]. Further details about the form of the unary,\npairwise terms and the inference approach are described in our technical report [29].\n3 Joint Model for Intrinsic Images, Objects and Attributes\nNow, we provide the details of our formulation for jointly estimating the intrinsic images (R, Z, L)\nalong with the objects (O) and attribute (A) properties given an image \u00afC in a probabilistic frame-\nwork. We de\ufb01ne the posterior probability and the corresponding joint energy function E as:\nP (R, Z, L, O, A|I) = 1/Z(I) exp{\u2212E(R, Z, L, O, A|I)}\nE(R, Z, L, O, A|I) = ESIRFSG(R, Z, L|O, A) +ERO(R, O)+ERA(R, A)+EOA(O, A)\n(3)\nWe de\ufb01ne ESIRFSG = ER(R) + EZ(Z) + EL(L), a new global energy term. The terms\nERO(R, O) and ERA(R, A) capture correlations between the re\ufb02ectance, objects and/or attribute\nlabels assigned to the pixels. These terms take the form of higher order potentials de\ufb01ned on\nthe image segments or regions of pixels generated using unsupervised segmentation approach of\nFelzenswalb and Huttenlocker [21]. Let S corresponds to the set of these image segments. These\nterms are described in detail below.\n3.1 SIRFS model for a scene\nGiven this representation of the scene, we model the scene speci\ufb01c ESIRFSG by a mixture of\nre\ufb02ectance, and depth terms embedded into the segmentation of the image and an illumination term\nas:\n\n\u00afC = R \u00b7 S(Z, L)\n\nsubject to\n\nESIRFSG(R, Z, L|O, A) =\n\nER(Rc) + EZ(Zc)\n\n+ EL(L)\n\n(4)\n\n(cid:17)\n\n(cid:16)\n\n(cid:88)\n\nc\u2208S\n\nwhere R = {Rc}, Z = {Zc}. Here ER(Rc) and EZ(Zc) are the re\ufb02ectance and depth terms\nrespectively de\ufb01ned over segments c \u2208 S. In the current formulation, we have assumed that we have\na single model of illumination L for whole scene which corresponds to a 27-dimensional vector of\nspherical harmonics [2].\n\n3\n\n\f3.2 Re\ufb02ectance, Objects term\nThe joint re\ufb02ectance-object energy term ERO(R, O) captures the relations between the objects\npresent in the scene and their re\ufb02ectance properties. Our higher order term takes following form:\n\nERO(R, O) =\n\n\u03c0c\no\u03c8(Rc) +\n\n\u03c0c\nr\u03c8(Oc)\n\n(5)\n\n(cid:88)\n\nc\u2208S\n\n(cid:88)\n\nc\u2208S\n\no\u03c8(Rc) is an object\nwhere Rc, Oc are the labeling for the subset of pixels c respectively. Here \u03c0c\ndependent quality sensitive higher order cost de\ufb01ned over the re\ufb02ectance variables, and \u03c0c\nr\u03c8(Oc) is\na re\ufb02ectance dependent quality sensitive higher order cost de\ufb01ned over the object variables. The term\n\u03c8(Rc) reduces the variance of the re\ufb02ectance values within a clique and takes the form \u03c8(Rc) =\n(cid:107)c(cid:107)\u03b8\u03b1 (\u03b8p + \u03b8vGr(c)) where\n\nGr(c) = exp\nHere (cid:107)c(cid:107) is the size of the clique, \u00b5c =\nand \u03b8\u03b1, \u03b8p, \u03b8v, \u03b8\u03b2 are constants. Further in order\nto measure the quality of the re\ufb02ectance assignment to the segment, we weight the higher order cost\n\u03c8(Rc) with an object dependent \u03c0c\no takes\nfollowing form:\n\no that measures the quality of the segment. In our case, \u03c0c\n\ni\u2208c Ri\n(cid:107)c(cid:107)\n\n(cid:107)c(cid:107)\n\n(6)\n\n.\n\n(cid:26)1\n\n\u03bbo\n\n\u03c0c\no =\n\nif Oi = l, \u2200i \u2208 c\notherwise\n\nwhere \u03bbo < 1 is a constant. This term allows variables within a segment to take different re\ufb02ectance\no gives rise to a\nvalues if the pixels in that segment take different object labels. Currently the term \u03c0c\nhard constraint on the penalty but can be extended to one that penalizes the cost softly as in [29].\nSimilarly we enforce higher order consistency over the object labeling in a clique c \u2208 S. The term\n\u03c8(Oc) takes the form of pattern-based P N -Potts model [7] as:\n\n(cid:19)\n\ni\u2208c(Ri \u2212 \u00b5c)2(cid:107)\n\n(cid:107)(cid:80)\n\n(cid:18)\n\n\u2212\u03b8\u03b2\n(cid:80)\n\n(7)\n\n(8)\n\n(9)\n\n(cid:26)\u03b3o\n\nl\n\u03b3o\nmax\n\nif Oi = l, \u2200i \u2208 c\notherwise\n\nif Gr(c) < K, \u2200i \u2208 c\notherwise\n\n\u03c8(Oc) =\n\n(cid:26)1\n\n\u03bbr\n\n\u03c0c\nr =\n\nl , \u03b3o\n\nwhere \u03b3o\nsensitive term \u03c0c\nterms on all constituent pixels of a segment, i.e., Gr(c) (de\ufb01ne earlier). Thus \u03c0c\nform:\n\nmax are constants. Further we weight this term with a re\ufb02ectance dependent quality\nr. In our experiment we measure this term based on the variance of re\ufb02ectance\nr takes following\n\nwhere K and \u03bbr < 1 are constants. Essentially, this quality measurement allows the pixels within\na segment to take different object labels, if the variation in the re\ufb02ectance terms within the segment\nis above a threshold. To summarize, these two higher order terms enforce the cost of inconsistency\nwithin the object and re\ufb02ectance labels.\n3.3 Re\ufb02ectance, Attributes term\nSimilarly we de\ufb01ne the term ERA(R, A) which enforces a higher order consistency between re-\n\ufb02ectance and attribute variables. Such higher order consistency takes the following form:\n\n(cid:88)\n\n(cid:16)(cid:88)\n\nm\n\nc\u2208S\n\n(cid:17)\n\n(cid:88)\n\nc\u2208S\n\nERA(R, A) =\n\n\u03c0c\na,m\u03c8(Rc) +\n\nr\u03c8(Am\n\u03c0c\nc )\n\n(10)\n\nr\u03c8(Am\n\na,m\u03c8(Rc) and \u03c0c\n\nwhere \u03c0c\nc ) are the higher order terms de\ufb01ned over the re\ufb02ectance image and\nthe attribute image corresponding to the mth attribute respectively. Forms of these terms are similar\nto the one de\ufb01ned for the object-re\ufb02ectance higher order terms; these terms are further explained in\nthe supplementary material.\n4\nGiven the above model, our optimization problem involves solving following joint energy function\nto get the most likely solution for (R, Z, L, O, A):\n\nInference and Learning\n\nE(R, Z, L, O, A|I) = ESIRFSG(R, Z, L) + ERO(R, O) + ERA(R, A) + EOA(O, A)\n\n(11)\n\n4\n\n\fHowever, this problem is very challenging since it consists of both the continuous variables\n(R, Z, L) and discrete variables (O, A). Thus in order to minimize the function ef\ufb01ciently with-\nout losing accuracy we follow an approximate dual decomposition strategy [28].\nWe \ufb01rst introduce a set of duplicate variables for the re\ufb02ectance (R1, R2, R3), objects (O1, O2),\nand attributes (A1, A2) and a set of new equality constraints to enforce the consistency on these\nduplicate variables. Our optimization problem thus takes the following form:\n\nminimize\n\nR1,R2,R3,Z,L,O1,O2\n\nE(R1, Z, L) + E(O1, A1) + E(R2, O2) + E(R3, A2)\n\nsubject to\n\n(12)\nFrom now on we have removed the subscripts and superscripts from the energy terms for simplicity\nof the notations. Now we formulate it as an unconstrained optimization problem by introducing a\nset of lagrange multipliers \u03b81\nr , \u03b8o, \u03b8a and decompose the dual problem into four sub-problems as:\n\nR1 = R2 = R3; O1 = O2; A1 = A2\n\nr , \u03b82\n\nE(R1, Z, L) + E(O1, A1) + E(R2, O2) + E(R3, A2) + \u03b81\nr (R2 \u2212 R3) + \u03b8o(O1 \u2212 O2) + \u03b8a(A1 \u2212 A2)\n\n+ \u03b82\n= g1(R1, Z, L) + g2(O1, A1) + g3(O2, R2) + g4(A2, R3),\n\nr (R1 \u2212 R2)\n\n(13)\n\nwhere\n\ng1(R1, Z, L) = minimizeR1,Z,L E(R1, Z, L) + \u03b81\ng2(O1, A1) = minimizeO1,A1 E(O1, A1) + \u03b8oO1 + \u03b8aA1\ng3(O2, R2) = minimizeO2,R2 E(O2, R2) \u2212 \u03b8oO2 \u2212 \u03b81\nr R2\ng4(A2, R3) = minimizeA2,R3 E(A2, R3) \u2212 \u03b8aA2 \u2212 \u03b82\nr R3\n\nr R1\n\nr , \u03b82\n\n(14)\nare the slave problems which are optimized separately and ef\ufb01ciently while treating the dual vari-\nables \u03b81\nr , \u03b8o, \u03b8a constant, and the master problem then optimizes these dual variables to enforce\nconsistency. Next, we solve each of the sub-problems and the master problem.\nSolving subproblem g1(R1, Z, L): Solving the sub-problem g1(R1, Z, L) requires optimizing\nwith only continuous variables (R1, Z, L). We follow a multi-scale LBFGS strategy [2] to opti-\nmize this part. Each step of the LBFGS approach requires evaluating the gradient of g1(R1, Z, L)\nwrt. R1, Z, L.\nSolving subproblem g2(O1, A1): The second sub-problem g2(O1, A1) involves only discrete\nvariables (O1, A1). The dual variable dependent terms add \u03b8oO1 to the object unary potential\n\u03c8i(O1) and \u03b8aA1 to the attribute unary potential \u03c8i(A1). Let \u03c8(cid:48)(O1) and \u03c8(cid:48)(A1) be the updated\nobject and attribute unary potentials. We follow a \ufb01lter-based mean-\ufb01eld strategy [11, 12] for the op-\ntimization. In the mean-\ufb01eld framework, given the true distribution P = exp(\u2212g2(O1,A1))\n, we \ufb01nd\nan approximate distribution Q, where approximation is measured in terms of the KL-divergence\nbetween the P and Q distributions. Here \u00afZ is the normalizing constant. Based on the model in\nis a multi-class\nSec. 2.2, Q takes the form as Qi(O1\ni,m is a binary distribution over {0,1}. With this, the\ndistribution over the object variable, and QA\nmean-\ufb01eld updates for the object variables take the following form:\n\nm), where QO\n\ni ) = QO\n\nm QA\n\ni,m(Ai\n\ni (O1\n\ni , A1\n\n\u00afZ\n\n1\n\ni\n\nQO\ni (O1\n\ni = l) =\n\n1\nZ O\ni\n\nexp{\u2212\u03c8(cid:48)\n\ni(O1\n\nQO\nj (O1\n\nj = l(cid:48))(\u03c8ij(O1\n\ni , O1\n\nj ))}\n\n(15)\n\ni )(cid:81)\n(cid:88)\n\ni ) \u2212 (cid:88)\n\nl(cid:48)\u22081..K\n\nj(cid:54)=i\n\nwhere \u03c8ij is a potts term modulated by a contrast sensitive pairwise cost de\ufb01ned by a mixture of\nGaussian kernels [12], and ZO\nis per-pixel normalization factor. Given this form of the pairwise\nterms, as in [12], we can ef\ufb01ciently evaluate the pairwise summations in Eq. 15 using K Gaussian\nconvolutions. The updates for the attribute variables also take similar form (refer to the supplemen-\ntary material).\nSolving subproblems g3(O2, R2), g4(A2, R3): These two problems take the following forms:\n\ni\n\ng3(O2, R2) = minimizeO2,R2\n\ng4(A2, R3) = minimizeA2,R3\n\n\u03c0c\nr2 \u03c8(O2\n\nc ) \u2212 \u03b8oO2 \u2212 \u03b81\n\n(16)\n\nr R2\n\n(cid:17)\u2212\u03b8aA2\u2212\u03b82\n\nr R3\n\n\u03c0c\nr3 \u03c8(A2,m\n\nc\n\n)\n\n(cid:88)\n\n(cid:88)\n(cid:88)\n\nc\u2208S\n\n\u03c0c\no2 \u03c8(R2\n\nc) +\n\n(cid:16)(cid:88)\n\nc\u2208S\n\u03c0c\na2,m\u03c8(R3\n\nc)+\n\n(cid:88)\n\nc\u2208S\n\nm\n\nc\u2208S\n\n5\n\n\fSolving of these two sub-problems requires optimization with both the continuous R2 and discrete\nO2, A2 variables respectively. However since these two sub-problems consist of higher order terms\n(described in Eq. 8) and dual variable dependent terms, we follow a simple co-ordinate descent\nstrategy to update the re\ufb02ectance and the object (and attribute) variables iteratively. The optimization\nof the object (and attribute) variables are performed in a mean-\ufb01eld framework, and a gradient\ndescent based approach is used for the re\ufb02ectance variables.\nSolving master problem The master problem then updates the dual-variables \u03b81\nr , \u03b8o, \u03b8a given\nr , \u03b82\nr; the updates\nthe current solution from the slaves. Here we provide the update equations for \u03b81\nfor the other dual variables take similar form. The master calculates the gradient of the problem\nE(R, Z, L, O, A|I) wrt. \u03b81\n\nr, and then iteratively updates the values of \u03b81\n\nr as:\n\n(cid:17)\n\n(cid:16)\n\n\u03b81\nr = \u03b81\n\nr + \u03b11\nr\n\ng\u03b81\n1 (R1, Z, L) + g\u03b81\n\nr\n\nr\n\n3 (O2, R2)\n\n(17)\n\nr\n\n1 , g\u03b81\n\nr\n\nr is the step size tth iteration and g\u03b81\n\n3 are the gradients w.r.t. to the \u03b81\n\nwhere \u03b1t\nr. It should be noted\nthat we do not guarantee the convergence of our approach since the subproblems g1(.) and g2(.) are\nsolved approximately. Further details on our inference techniques are provided in the supplementary\nmaterial.\nLearning:\nIn the model described above, there are many parameters joining each of these terms.\nWe use a cross-validation strategy to estimate these parameters in a sequential manner and thus\nensuring ef\ufb01cient strategy to estimate a good set of parameters. The unary potentials for the objects\nand attributes are learnt using a modi\ufb01ed TextonBoost model of Ladicky et.al. [9] which uses a\ncolour, histogram of oriented gradient (HOG), and location features.\n5 Experiments\nWe demonstrate our joint estimation approach on both the per-pixel object and attribute labelling\ntasks, and estimation of the intrinsic properties of the images. For the object and attribute labelling\ntasks, we conduct experiments on the NYU 2 [25] and Pascal [26] datasets both quantitatively and\nqualitatively. To this end, we annotate the NYU 2 and the Pascal datasets with per-pixel attribute\nlabels. As a baseline, we compare our joint estimation approach against the mean-\ufb01eld based method\n[12], and the graph-cuts based \u03b1-expansion method [9]. We assess the accuracy in terms of the\noverall percentage of the pixels correctly labelled, and the intersection/union score per class (de\ufb01ned\nin terms of the true/false positives/negatives for a given class as TP/(TP+FP+FN)). Additionally we\nalso evaluate our approach in estimating better intrinsic properties of the images though qualitatively\nonly, since it is extremely dif\ufb01cult to generate the ground truths for the intrinsic properties, e.g.\nre\ufb02ectance, depth and illumination for any general image. We compare our intrinsic properties\nresults against the model of Barron and Malik2[2, 4], Gehler et.al. [5] and the Retinex model [17].\nFurther, only visually we also show how our approach is able to recover better smooth and de-noised\ndepth maps compared to the raw depth provided by the Kinect [25]. In all these cases, we use the\ncode provided by the authors for the AHCRF [9], mean-\ufb01eld approach [11, 12]. Details of all the\nexperiments are provided below.\n5.1\nWe \ufb01rst conduct experiment on aNYU 2 RGBD dataset, an extended version of the indoor NYU\n2 dataset [25]. The dataset consists of 725 training images, 100 validation and 624 test images.\nFurther, the dataset consists of per-pixel object and attribute labels (see Fig. 1 and 3 for per-pixel\nattribute labels). We select 15 object and 8 attribute classes that have suf\ufb01cient number of instances\nto train the unary classi\ufb01er responses. The object labels corresponds to some indoor object classes\nas \ufb02oor, wall, .. and attribute labels corresponds to material properties of the objects as wooden,\npainted, .... Further, since this dataset has depth from the Kinect depths, we use them to initialize\nthe depth maps Z for both our joint estimation approach and the Barron and Malik models [2-4].\nWe show quantitative and qualitative results in Tab. 1 and Fig. 3 respectively. As shown, our joint\napproach achieves an improvement of almost 2.3% , and 1.2% in the overall accuracy and average\nintersection-union (I/U) score over the model of AHCRF [9], and almost 1.5 % improvement in the\n\naNYU 2 dataset\n\n2We extended the SIRFS [2] model to our Scene-SIRFS using a mixture of re\ufb02ectance and depth maps,\nand a single illumination model. These mixtures of re\ufb02ectance and depth maps were embedded in the soft\nsegmentation of the scene generated using the approach of Felzenswalb et.al. [21]. We call this model: Barron\nand Malik [2,4].\n\n6\n\n\fAlgorithm\nAHCRF [9]\n\nDenseCRF [12]\nOurs (OA+Intr)\n\nAv. I/U Oveall(% corr)\n28.88\n29.66\n30.14\n\n51.06\n50.70\n52.23\n\nAlgorithm\nAHCRF [9]\n\nDenseCRF [12]\nOurs (OA+Intr)\n\n21.9\n22.02\n24.175\n\n40.7\n37.6\n39.25\n\nAv. I/U Oveall(% corr)\n\n(a) Object Accuracy\n\n(b) Attribute Accuracy\n\nTable 1: Quantitative results on aNYU 2 dataset for both the object segmentation (a), and attributes\nsegmentation (b) tasks. The table compares performance of our approach (last line) against three\nbaselines. The importance of our joint estimation for intrinsic images, objects and attributes is\ncon\ufb01rmed by the better performance of our algorithm compared to the graph-cuts based (AHCRF)\nmethod [9] and mean-\ufb01eld based approach [12] for both the tasks. Here intersection vs. union (I/U)\nis de\ufb01ned as\n\nT P +F N +F P and \u2019% corr\u2019 as the total proportional of correctly labelled pixels.\n\nT P\n\nInput Image\n\nour re\ufb02ectance\n\nour shading\n\nour normals\n\nour depth\n\nre\ufb02ectance [17]\n\nre\ufb02ectance[5]\n\nKinect depth\n\nre\ufb02ectance [2,4]\n\nshading [2,4]\n\nnormals [2,4]\n\ndepth [2,4]\n\nshading [17]\n\nshading[5]\n\nInput Image\n\nour re\ufb02ectance\n\nour shading\n\nour normals\n\nour depth\n\nre\ufb02ectance [17]\n\nre\ufb02ectance[5]\n\nKinect depth\n\nre\ufb02ectance [2,4]\n\nshading [2,4]\n\nnormals [2,4]\n\ndepth [2,4]\n\nshading [17]\n\nshading[5]\n\nFigure 2: Given an image and its depth image for the aNYU dataset, these \ufb01gures qualitatively com-\npare our algorithm in jointly estimating better the intrinsic properties such as re\ufb02ectance, shading,\nnormals and depth maps. We compare against the model Barron and Malik [2,4], the Retinex model\n[17] (2nd last column) and the Gehler et.al. approach [5] (last column).\n\naverage I/U over the model of [12] for the object class segmentation . Similarly we also observe an\nimprovement of almost 2.2 % and 0.5 % in the overall accuracy and I/U score over AHCRF [12],\nand almost 2.1 % and 1.6 % in the overall accuracy and average I/U over the model of [12] for the\nper-pixel attribute labelling task. These quantitative improvement suggests that our model is able to\nimprove the object and attribute labelling using the intrinsic properties information. Qualitatively\nalso we observe an improvement in the output of both the object and attribute segmentation tasks as\nshown in Fig. 3.\nFurther, we show the qualitative improvement in the results of the intrinsic properties in the Fig. 2.\nAs shown our joint approach helps to recover better depth map compared to the noisy kinect depth\nmaps; justifying the uni\ufb01cation of reconstruction and objects and attributes based recognition tasks.\nFurther, our re\ufb02ectance and shading images visually look much better than the models of Retinex\n[17] and Gehler et.al. [5], and similar to the Barron and Malik approach [2,4].\n5.2\nWe also show experiments on aPascal dataset, our extended Pascal dataset with per-pixel attribute\nlabels. We select a subset of 517 images with the per-pixel object labels from the Pascal dataset and\nannotate it with 7 material attribute labels at the pixel level. These attributes correspond to wooden,\nskin, metallic, glass, shiny... etc. Further for the Pascal dataset we do not have any initial depth\nestimate. Thus, we start with a depth map where each point in the space is given same constant\ndepth value.\nSome quantitative and qualitative results are shown in Tab. 2 and Fig. 3 respectively. As shown, our\napproach achieves an improvement of almost 2.0 % and 0.5 % in the I/U score for the object and\n\naPascal dataset\n\n7\n\n\fAlgorithm\nAHCRF [9]\n\nDenseCRF [12]\nOurs (OA + Intr)\n\nAv. I/U Oveall(% corr)\n32.53\n36.9\n38.1\n\n82.30\n79.4\n81.4\n\nAlgorithm\nAHCRF [9]\n\nDenseCRF [12]\nOurs (OA+Intr)\n\n17.4\n18.28\n18.85\n\n95.1\n96.2\n96.7\n\nAv. I/U Oveall(% corr)\n\n(a) Object Accuracy\n\n(b) Attribute Accuracy\n\nTable 2: Quantitative results on aPascal dataset for both the object segmentation (a), and attributes\nsegmentation (b) tasks. The table compares performance of our approach (last line) against three\nbaselines. The importance of our joint estimation for intrinsic images, objects and attributes is\ncon\ufb01rmed by the better performance of our algorithm compared to the graph-cuts based (AHCRF)\nmethod [9] and mean-\ufb01eld based approach [12] for both the tasks. Here intersection vs. union (I/U)\nis de\ufb01ned as\n\nT P +F N +F P and \u2019% corr\u2019 as the total proportional of correctly labelled pixels.\n\nT P\n\nattribute labelling tasks respectively over the model of [12]. We observe qualitative improvement in\nthe accuracy shown in Fig. 3.\n\nInput Image\n\nRe\ufb02ectance\n\nDepth\n\nGround truth\n\nOutput [9]\n\nOutput [10]\n\nOur Object\n\nOur Attribute\n\nNYU Object-color coding\n\nAttribute-color coding\n\nFigure 3: Qualitative results on aNYU (\ufb01rst 2 lines) and aPascal (last line) dataset. From left to\nright: input image, re\ufb02ectance, depth images, ground truth, output from [9] (AHCRF), output from\n[12], our output for the object segmentation. Last column shows our attribute segmentation output.\n(Attributes for NYU dataset: wood, painted, cotton, glass, brick, plastic, shiny, dirty; Attributes for\nPascal dataset: skin, metal, plastic, wood, cloth, glass, shiny.)\n\n6 Discussion and Conclusion\nIn this work, we have explored the synergy effects between intrinsic properties of an images, and\nthe objects and attributes present in the scene. We cast the problem in a joint energy minimization\nframework; thus our model is able to encode the strong correlations between intrinsic properties\n(re\ufb02ectance, shape,illumination), objects (table, tv-monitor), and materials (wooden, plastic) in a\ngiven scene. We have shown that dual-decomposition based techniques can be effectively applied to\nperform optimization in the joint model. We demonstrated its applicability on the extended versions\nof the NYU and Pascal datasets. We achieve both the qualitative and quantitative improvements for\nthe object and attribute labeling, and qualitative improvement for the intrinsic images estimation.\nFuture directions include further exploration of the possibilities of integrating priors based on the\nstructural attributes such as slanted, cylindrical to the joint intrinsic properties, objects and attributes\nmodel. For instance, knowledge that the object is slanted would provide a prior for the depth and\ndistribution of the surface normals. Further, the possibility of incorporating a mixture of illumination\nmodels to better model the illumination in a natural scene remains another future direction.\nAcknowledgements. This work was supported by the IST Programme of the European Commu-\nnity, under the PASCAL2 Network of Excellence, IST-2007-216886. P.H.S. Torr is in receipt of\nRoyal Society Wolfson Research Merit Award.\nReferences\n[1] Barron, J.T. & Malik, J. (2012) Shape, albedo, and illumination from a single image of an unknown object.\nIn IEEE CVPR, pp. 334-341. Providence, USA.\n[2] Barron, J.T. & Malik, J. (2012) Color constancy, intrinsic images, and shape estimation. In ECCV, pp.\n57-70. Florence, Italy.\n\n8\n\n\f[3] Barron, J.T. & Malik, J. (2012) High-frequency shape and albedo from shading using natural image statis-\ntics. In IEEE CVPR, pp. 2521-2528. CO, USA.\n[4] Barron, J., & Malik, J. (2013) Intrinsic scene properties from a single RGB-D image. In IEEE CVPR.\n[5] Gehler, P.V., Rother, C., Kiefel, M., Zhang, L. & Bernhard, S. (2011) Recovering intrinsic images with a\nglobal sparsity prior on re\ufb02ectance. In NIPS, pp. 765-773. Granada, Spain.\n[6] Farhadi, A., Endres, I., Hoiem, D. & Forsyth D.A., (2009) Describing objects by their attributes. In IEEE\nCVPR, pp. 1778-1785. Miami, USA.\n[7] Kohli, P., Kumar, M.P., & Torr, P.H.S. (2009) P & beyond: move making algorithms for solving higher\norder functions. In IEEE PAMI, pp. 1645-1656.\n[8] Ladicky, L., Sturgess, P., Russell C., Sengupta, S., Bastnlar, Y., Clocksin, W.F., & Torr P.H.S. (2012) Joint\noptimization for object class segmentation and dense stereo reconstruction. In IJCV, pp. 739-746.\n[9] Ladicky, L., Russell C., Kohli P. & Torr P.H.S., (2009) Associative hierarchical CRFs for object class image\nsegmentation. In IEEE ICCV, pp. 739-746. Kyoto, Japan.\n[10] Sloan, P.P., Kautz, J., & Snyder, J., (2002) Precomputed radiance transfer for real-time rendering in dy-\nnamic, low-frequency lighting environments. In SIGGRAPH, pp. 527-536.\n[11] Vineet, V., Warrell J., & Torr P.H.S., (2012) Filter-based mean-\ufb01eld inference for random \ufb01elds with\nhigher-order terms and product label-spaces . In IEEE ECCV, pp. 31-44. Florence, Italy.\n[12] Kr\u00a8ahenb\u00a8uhl P. & Koltun V., (2011) Ef\ufb01cient inference in fully connected CRFs with Gaussian edge poten-\ntials. In IEEE NIPS, pp. 109-117. Granada, Spain.\n[13] Barrow, H.G. & Tenenbaum, J.M. (1978) Recovering intrinsic scene characteristics from images. In A.\nHanson and E. Riseman, editors, Computer Vision Systems, pp. 3-26. Academic Press, 1978.\n[14] Weijer, J.V.d., Schmid, C. & Verbeek, J. (2007) Using high-level visual information for color constancy.\nIn IEEE, ICCV, pp. 1-8.\n[15] Liu, C., Sharan, L., Adelson, E.H., & Rosenholtz, R. (2010) Exploring features in a bayesian framework\nfor material recognition. In IEEE, CVPR, pp. 239-246.\n[16] Horn, B.K.P. (1970) Shape from shading: a method for obtaining the shape of a smooth opaque object\nfrom one view. Technical Report, MIT.\n[17] Land, E.H., & McCann, J.J. (1971) Lightness and retinex theory. In JOSA.\n[18] Osadchy, M., Jacobs, D.W. & Ramamoorthi, R. (2003) Using specularities for recognition . In IEEE ICCV.\n[19] Adelson, E.H. (2000) Lightness perception and lightness illusions. The New Cognitive Neuroscience, 2nd\nEd. MIT Press, pp. 339-351.\n[20] Adelson, E.H. (2001) On seeing stuff: the perception of materials by humans and machines. SPIE, vol.\n4299, pp. 1-12.\n[21] Felzenswalb, P.F., & Huttenlocker, D.P. (2004) Ef\ufb01cient graph-based image segmentation. In IJCV.\n[22] Shotton, J., Winn, J., Rother, C., & Criminisi, A. (2003) TextonBoost for Image Understanding: Multi-\nClass Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context. In IEEE IJCV.\n[23] Tighe, J. & Lazebnik, S. (2011) Understanding scenes on many levels. In IEEE ICCV pp. 335-342.\n[24] LeCun, Y., Huang, F.J., & Bottou, L. (2004) Learning methods for generic object recognition with invari-\nance to pose and lighting. In IEEE CVPR pp. 97-104.\n[25] Silberman, N., Hoim, D., Kohli, P., & Fergus, R. (2012) Indoor segmentation and support inference from\nRGBD images. In IEEE ECCV pp. 746-760.\n[26] Everingham, M., Gool, L.J.V., Williams, C.K.I., Winn, J.M. & Zisserman, A. (2010) The pascal visual\nobject classes (VOC) challenge. In IEEE IJCV pp. 303-338.\n[27] Cheng, M. M., Zheng, S., Lin, W.Y., Warrell, J., Vineet, V., Sturgess, P., Mitra, N., Crook, N., & Torr,\nP.H.S. (2013) ImageSpirit: Verbal Guided Image Parsing. Oxford Brookes Technical Report.\n[28] Domj, Q. T., Necoara, I., & Diehl, M. (2013) Fast Inexact Decomposition Algorithms for Large-Scale\nSeparable Convex Optimization. In JOTA.\n[29] Kohli, P., Ladicky, L., & Torr, P.H.S. (2008) on. In IEEE CVPR, 2008.\n\n9\n\n\f", "award": [], "sourceid": 354, "authors": [{"given_name": "Vibhav", "family_name": "Vineet", "institution": "Oxford Brookes University"}, {"given_name": "Carsten", "family_name": "Rother", "institution": "TU Dresden"}, {"given_name": "Philip", "family_name": "Torr", "institution": "University of Oxford"}]}