{"title": "Approximate Bayesian Image Interpretation using Generative Probabilistic Graphics Programs", "book": "Advances in Neural Information Processing Systems", "page_first": 1520, "page_last": 1528, "abstract": "The idea of computer vision as the Bayesian inverse problem to computer graphics has a long history and an appealing elegance, but it has proved difficult to directly implement. Instead, most vision tasks are approached via complex bottom-up processing pipelines. Here we show that it is possible to write short, simple probabilistic graphics programs that define flexible generative models and to automatically invert them to interpret real-world images. Generative probabilistic graphics programs consist of a stochastic scene generator, a renderer based on graphics software, a stochastic likelihood model linking the renderer's output and the data, and latent variables that adjust the fidelity of the renderer and the tolerance of the likelihood model. Representations and algorithms from computer graphics, originally designed to produce high-quality images, are instead used as the deterministic backbone for highly approximate and stochastic generative models. This formulation combines probabilistic programming, computer graphics, and approximate Bayesian computation, and depends only on general-purpose, automatic inference techniques. We describe two applications: reading sequences of degraded and adversarially obscured alphanumeric characters, and inferring 3D road models from vehicle-mounted camera images. Each of the probabilistic graphics programs we present relies on under 20 lines of probabilistic code, and supports accurate, approximately Bayesian inferences about ambiguous real-world images.", "full_text": "Approximate Bayesian Image Interpretation using\n\nGenerative Probabilistic Graphics Programs\n\nVikash K. Mansinghka\u21e4 1,2, Tejas D. Kulkarni\u21e4 1,2, Yura N. Perov1,2,3, and Joshua B. Tenenbaum1,2\n\n1Computer Science and Arti\ufb01cial Intelligence Laboratory, MIT\n\n2Department of Brain and Cognitive Sciences, MIT\n\n3Institute of Mathematics and Computer Science, Siberian Federal University\n\nAbstract\n\nThe idea of computer vision as the Bayesian inverse problem to computer graphics\nhas a long history and an appealing elegance, but it has proved dif\ufb01cult to directly\nimplement.\nInstead, most vision tasks are approached via complex bottom-up\nprocessing pipelines. Here we show that it is possible to write short, simple prob-\nabilistic graphics programs that de\ufb01ne \ufb02exible generative models and to automati-\ncally invert them to interpret real-world images. Generative probabilistic graphics\nprograms (GPGP) consist of a stochastic scene generator, a renderer based on\ngraphics software, a stochastic likelihood model linking the renderer\u2019s output and\nthe data, and latent variables that adjust the \ufb01delity of the renderer and the toler-\nance of the likelihood. Representations and algorithms from computer graphics\nare used as the deterministic backbone for highly approximate and stochastic gen-\nerative models. This formulation combines probabilistic programming, computer\ngraphics, and approximate Bayesian computation, and depends only on general-\npurpose, automatic inference techniques. We describe two applications: read-\ning sequences of degraded and adversarially obscured characters, and inferring\n3D road models from vehicle-mounted camera images. Each of the probabilistic\ngraphics programs we present relies on under 20 lines of probabilistic code, and\nyields accurate, approximately Bayesian inferences about real-world images.\n\n1\n\nIntroduction\n\nComputer vision has historically been formulated as the problem of producing symbolic descriptions\nof scenes from input images [10]. This is usually done by building bottom-up processing pipelines\nthat isolate the portions of the image associated with each scene element and extract features that\nsignal its identity. Many pattern recognition and learning techniques can then be used to build\nclassi\ufb01ers for individual scene elements, and sometimes to learn the features themselves [11, 7].\nThis approach has been remarkably successful, especially on problems of recognition. Bottom-up\npipelines that combine image processing and machine learning can identify written characters with\nhigh accuracy and recognize objects from large sets of possibilities. However, the resulting systems\ntypically require large training corpuses to achieve reasonable levels of accuracy, and are dif\ufb01cult\nboth to build and modify. For example, the Tesseract system [16] for optical character recognition\nis over 10, 000 lines of C++. Small changes to the underlying assumptions frequently necessitates\nend-to-end retraining and/or redesign.\nGenerative models for a range of image parsing tasks are also being explored [17, 4, 18, 22, 20].\nThese provide an appealing avenue for integrating top-down constraints with bottom-up processing,\n\n* The \ufb01rst two authors contributed equally to this work.\n* (vkm, tejask, perov, jbt)@mit.edu \u2014 Project URL: http://probcomp.csail.mit.edu/gpgp/\n\n1\n\n\fand provide an inspiration for the approach we take in this paper. But like traditional bottom-up\npipelines for vision, these approaches have relied on considerable problem-speci\ufb01c engineering,\nchie\ufb02y to design and/or learn custom inference strategies, such as MCMC proposals [18, 22] that\nincorporate bottom-up cues. Other combinations of top-down knowledge with bottom up processing\nhave been remarkably powerful [9]. For example, [8] has shown that global, 3D geometric informa-\ntion can signi\ufb01cantly improve the performance of bottom-up object detectors.\nIn this paper, we propose a novel formulation of image interpretation problems, called generative\nprobabilstic graphics programming (GPGP). GPGP shares a common template: a stochastic scene\ngenerator, an approximate renderer based on existing graphics software, a highly stochastic likeli-\nhood model for comparing the renderer\u2019s output with the observed data, and latent variables that\ncontrol the \ufb01delity of the renderer and the tolerance of the image likelihood. Our probabilistic\ngraphics programs are written in Venture, a probabilistic programming language descended from\nChurch [6]. Each model we introduce requires less than 20 lines of probabilistic code. The ren-\nderers and likelihoods for each are based on standard templates written as short Python programs.\nUnlike typical generative models for scene parsing, inverting our probabilistic graphics programs re-\nquires no custom inference algorithm design. Instead, we rely on the automatic Metropolis-Hastings\n(MH) transition operators provided by our probabilistic programming system. The approximations\nand stochasticity in our renderer, scene generator and likelihood models serve to implement a variant\nof approximate Bayesian computation [19, 12]. This combination can produce a kind of self-tuning\nanalogue of annealing that facilities reliable convergence.\nTo the best of our knowledge, our GPGP framework is the \ufb01rst real-world image interpretation for-\nmulation to combine all of the following elements: probabilistic programming, automatic inference,\ncomputer graphics, and approximate Bayesian computation; this constitutes our main contribution.\nOur second contribution is to provide demonstrations of the ef\ufb01cacy of this approach on two im-\nage interpretation problems: reading snippets of degraded and adversarially obscured alphanumeric\ncharacters, and inferring 3D road models from vehicle mounted cameras. In both cases we quanti-\ntatively report the accuracy of our approach on representative test datasets, as compared to standard\nbottom-up baselines that have been extensively engineered.\n\n2 Generative Probabilistic Graphics Programs and Approximate Bayesian\n\nInference.\n\nGPGP de\ufb01nes generative models for images by combining four components. The \ufb01rst is a stochas-\ntic scene generator written as probabilistic code that makes random choices for the location and\ncon\ufb01guration of the main elements in the scene. The second is an approximate renderer based on\nexisting graphics software that maps a scene S and control variables X to an image IR = f (S, X).\nThe third is a stochastic likelihood model for image data ID that enables scoring of rendered scenes\ngiven the control variables. The fourth is a set of latent variables X that control the \ufb01delity of the\nrenderer and/or the tolerance in the stochastic likelihood model. These components are described\nschematically in Figure 1.\nWe formulate image interpretation tasks in terms of sampling (approximately) from the posterior\ndistribution over images:\n\nP (S|ID) /Z P (S)P (X)f (S,X)(IR)P (ID|IR, X)dX\n\nWe perform inference over execution histories of our probabilistic graphics programs using a\nuniform mixture of generic, single-variable Metropolis-Hastings transitions, without any custom,\nbottom-up proposals. We \ufb01rst give a general description of the generative model and inference algo-\nrithm induced by our probabilistic graphics programs; in later sections, we describe speci\ufb01c details\nfor each application.\nLet S = {Si} be a decomposition of the scene S into parts Si with independent priors P (Si). For\nexample, in our text application, the Sis include binary indicators for the presence or absence of each\nglyph, along with its identity (\u201cA\u201c through \u201dZ\u201d, plus digits 0-9), and parameters including location,\nsize and rotation. Also let X = {Xj} be a decomposition of the control variables X into parts Xj\nwith priors P (Xj), such as the bandwidths of per-glyph Gaussian spatial blur kernels, the variance\n\n2\n\n\fStochastic\n\nScene Generator\n\nX ~ P(X)\n\nS ~ P(S)\n\nApproximate\n\nRenderer\n\nIR = f(S,X)\n\nData ID\n\nStochastic\nComparison\n\nP(ID|IR,X)\n\nFigure 1: An overview of the GPGP framework. Each of our models shares a common template: a\nstochastic scene generator which samples possible scenes S according to their prior, latent variables\nX that control the \ufb01delity of the rendering and the tolerance of the model, an approximate render\nf (S, X) ! IR based on existing graphics software, and a stochastic likelihood model P (ID|IR, X)\nthat links observed rendered images. A scene S\u21e4 sampled from the scene generator according to\nP (S) could be rendered onto a single image I\u21e4R. This would be extremely unlikely to exactly match\nthe data I\u21e4D. Instead of requiring exact matches, our formulation can broaden the renderer\u2019s output\nP (IR|S\u21e4) and the image likelihood P (I\u21e4D|IR) via the latent control variables X. Inference over X\nmediates the degree of smoothing in the posterior.\n\nof a Gaussian image likelihood, and so on. Our proposals modify single elements of the scene and\ncontrol variables at a time, as follows:\n\nP (Si)\n\nqi(S0i, Si) = P (S0i)\n\nP (Xj)\n\nqj(X0j, Xj) = P (X0j)\n\nP (X) =Yj\n\nP (S) =Yi\nNow let K = |{Si}| + |{Xj}| be the total number of random variables in each execution. For\nsimplicity, we describe the case where this number can be bounded above beforehand, i.e.\ntotal\na priori scene complexity is limited. At each inference step, we choose a random variable index\nk < K uniformly at random. If k corresponds to a scene variable i, then we propose from qi(S0i, Si),\nso our overall proposal kernel q((S, X) ! (S0, X0)) = Si(S0)P (S0i)X(X0). If k corresponds\nto a control variable j, we propose from qj(X0j, Xj). In both cases we re-render the scene I0R =\nf (S0, X0). We then run the kernel associated with this variable, and accept or reject via the MH\nequation:\n\n\u21b5M H((S, X) ! (S0, X0)) = min1,\n\nP (ID|f (S, X), X)P (S)P (X)q((S, X) ! (S0, X0)) \nP (ID|f (S0, X0), X0)P (S0)P (X0)q((S0, X0) ! (S, X))\n\nWe implement our probabilistic graphics programs in the Venture probabilistic programming lan-\nguage. The Metropolis-Hastings inference algorithm we use is provided by default in this system;\nno custom inference code is required. In the context of our GPGP formulation, this algorithm makes\nimplicit use of ideas from approximate Bayesian computation (ABC). ABC methods approximate\nBayesian inference over complex generative processes by using an exogenous distance function\nto compare sampled outputs with observed data. In the original rejection sampling formulation,\nsamples are accepted only if they match the data within a hard threshold. Subsequently, combina-\ntions of ABC and MCMC were proposed [12], including variants with inference over the threshold\nvalue [15]. Most recently, extensions have been introduced where the hard cutoff is replaced with\na stochastic likelihood model [19]. Our formulation incorporates a combination of these insights:\nrendered scenes are only approximately constrained to match the observed image, with the tight-\nness of the match mediated by inference over factors such as the \ufb01delity of the rendering and the\nstochasticity in the likelihood. This allows image variability that is unnecessary or even undesirable\nto model to be treated in a principled fashion.\n\n3\n\n\fFigure 2: Four input images from our CAPTCHA corpus, along with the \ufb01nal results and conver-\ngence trajectory of typical inference runs. The \ufb01rst row is a highly cluttered synthetic CAPTCHA\nexhibiting extreme letter overlap. The second row is a CAPTCHA from TurboTax, the third row\nis a CAPTCHA from AOL, and the fourth row shows an example where our system makes errors\non some runs. Our probabilistic graphics program did not originally support rotation, which was\nneeded for the AOL CAPTCHAs; adding it required only 1 additional line of probabilistic code. See\nthe main text for quantitative details, and supplemental material for the full corpus.\n\n3 Generative Probabilistic Graphics in 2D for Reading Degraded Text.\nWe developed a probabilistic graphics program for reading short snippets of degraded text consisting\nof arbitrary digits and letters. See Figure 2 for representative inputs and outputs. In this program,\nthe latent scene S = {Si} contains a bank of variables for each glyph, including whether a potential\nletter is present or absent from the scene, what its spatial coordinates and size are, what its identity\nis, and how it is rotated:\ni = y) =\u21e21/h 0 \uf8ff x \uf8ff h\ni = g) =\u21e21/2\u2713max \u2713max \uf8ff S\u2713\n\n= g) =(1/G 0 \uf8ff Sglyph id\n\ni = 1) = 0.5 P (Sx\n\nP (Sglyph id\n\ni = x) =\u21e21/w 0 \uf8ff x \uf8ff w\n\notherwise\n\n0\n\n0\n\notherwise\n\n0\n\notherwise\n\n< G\n\nP (S\u2713\n\n0\n\notherwise\n\nP (Spres\n\ni\n\nP (Sy\n\ni\n\ni <\u2713 max\n\nOur renderer rasterizes each letter independently, applies a spatial blur to each image, composites\nthe letters, and then blurs the result. We also applied global blur to the original training image\nbefore applying the stochastic likelihood model on the blurred original and rendered images. The\nstochastic likelihood model is a multivariate Gaussian whose mean is the blurry rendering; formally,\nID \u21e0 N (IR; ). The control variables X = {Xj} for the renderer and likelihood consist of per-\nletter Gaussian spatial blur bandwidths X i\nj \u21e0  \u00b7 Beta(1, 2), a global image blur on the rendered\nimage Xblur rendered \u21e0  \u00b7 Beta(1, 2), a global image blur on the original test image Xblur test \u21e0\n \u00b7 Beta(1, 2), and the standard deviation of the Gaussian likelihood  \u21e0 Gamma(1, 1) (with ,\n and  set to favor small bandwidths). To make hard classi\ufb01cation decisions, we use the sample\nwith lowest pixel reconstruction error from a set of 5 approximate posterior samples. We also\nexperimented with enabling enumerative (griddy) Gibbs sampling for uniform discrete variables\nwith 10% probability. The probabilistic code for this model is shown in Figure 4.\nTo assess the accuracy of our approach on adversarially obscured text, we developed a corpus con-\nsisting of over 40 images from widely used websites such as TurboTax, E-Trade, and AOL, plus\nadditional challenging synthetic CAPTCHAs with high degrees of letter overlap and superimposed\ndistractors. Each source of text violates the underlying assumptions of our probabilistic graphics\nprogram in different ways. TurboTax CAPTCHAs incorporate occlusions that break strokes within\n\n4\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n(f)\n\nFigure 3: Inference over renderer \ufb01delity signi\ufb01cantly improves the reliability of inference. (a) Re-\nconstruction errors for 5 runs of two variants of our probabilistic graphics program for text. Without\nsuf\ufb01cient stochasticity and approximation in the generative model \u2014 that is, with a strong prior over\na purely deterministic, high-\ufb01delity renderer \u2014 inference gets stuck in local energy minima (red\nlines). With inference over renderer \ufb01delity via per-letter and global blur, the tolerance of the image\nlikelihood, and the number of letters, convergence improves substantially (blue lines). Many local\nminima in the likelihood are escaped over the course of single-variable inference, and the blur vari-\nables are automatically adjusted to support localizing and identifying letters. (b) Clockwise from\ntop left: an input CAPTCHA, two typical local minima, and one correct parse. (c,d,e,f) A repre-\nsentative run, illustrating the convergence dynamics that result from inference over the renderer\u2019s\n\ufb01delity. From left to right, we show overall log probability, pixel-wise disagreement (many local\nminima are escaped over the course of inference), the number of active letters in the scene, and the\nper-letter blur variables. Inference automatically adjusts blur so that newly proposed letters are often\nblurred out until they are localized and identi\ufb01ed accurately.\n\nletters, while AOL CAPTCHAs include per-letter warping. These CAPTCHAs all involve arbitrary\ndigits and letters, and as a result lack cues from word identity that the best published CAPTCHA\nbreaking systems depend on [13]. The dynamically-adjustable \ufb01delity of our approximate renderer\nand the high stochasticity of our generative model appear to be necessary for inference to robustly\nescape local minima. We have observed a kind of self-tuning annealing resulting from inference\nover the control variables; see Figure 3 for an illustration. We observe robust character recognition\ngiven enough inference, with an overall character detection rate of 70.6%. To calibrate the dif\ufb01culty\nof our corpus, we also ran the Tesseract optical character recognition engine [16] on our corpus; its\ncharacter detection rate was 37.7%.\n\n4 Generative Probabilistic Graphics in 3D: Road Finding.\nWe have also developed a generative probabilistic graphics program for localizing roads in 3D from\nsingle images. This is an important problem in autonomous driving. As with many perception\nproblems in robotics, there is clear scene structure to exploit, but also considerable uncertainty\nabout the scene, as well as substantial image-to-image variability that needs to be robustly ignored.\nSee Figure 5b for example inputs.\nThe probabilistic graphics program we use for this problem is shown in Figure 7. The latent\nscene S is comprised of the height of the roadway from the ground plane, the road\u2019s width and\nlane size, and the 3D offset of the corner of the road from the (arbitrary) camera location. The\nprior encodes assumption that the lanes are small relative to the road, and that the road has two\nlanes and is very likely to be visible (but may not be centered). This scene is then rendered to\nproduce a surface-based segmentation image, that assigns each input pixel to one of 4 regions\nr 2 R = {left o\u21b5road, right o\u21b5road, road, lane}. Rendering is done for each scene element sep-\narately, followed by compositing, as with our 2D text program. See Figure 5a for random surface-\nbased segmentation images drawn from this prior. Extensions to richer road and ground geometries\nare an interesting direction for future work. This model is similar in spirit to [1] but the key differ-\n\n5\n\n\fASSUME is_present (mem (lambda (id) (bernoulli 0.5)))\nASSUME pos_x (mem (lambda (id) (uniform_discrete 0 200)))\nASSUME pos_y (mem (lambda (id) (uniform_discrete 0 200)))\nASSUME size_x (mem (lambda (id) (uniform_discrete 0 100)))\nASSUME size_y (mem (lambda (id) (uniform_discrete 0 100)))\nASSUME rotation (mem (lambda (id) (uniform_continuous -20.0 20.0)))\nASSUME glyph (mem (lambda (id) (uniform_discrete 0 35))) // 26 + 10.\nASSUME blur (mem (lambda (id) (* 7 (beta 1 2))))\nASSUME global_blur (* 7 (beta 1 2))\nASSUME data_blur (* 7 (beta 1 2))\nASSUME epsilon (gamma 1 1)\nASSUME data (load_image \"captcha_1.png\" data_blur)\nASSUME image (render_surfaces max-num-glyphs global_blur\n(pos_x 1) (pos_y 1) (glyph 1) (size_x 1) (size_y 1) (rotation 1) (blur 1)\n(is_present 1) (pos_x 2) (pos_y 2) (glyph 2) (size_x 2) (size_y 2)\n(rotation 2) (blur 2) (is_present 2) ... (is_present 10))\nOBSERVE (incorporate_stochastic_likelihood data image epsilon) True\n\nFigure 4: A generative probabilistic graphics program for reading degraded text. The scene genera-\ntor chooses letter identity (A-Z and digits 0-9), position, size and rotation at random. These random\nvariables are fed into the renderer, along with the bandwidths of a series of spatial blur kernels (one\nper letter, another for the overall rendered image from generative model and another for the original\ninput image). These blur kernels control the \ufb01delity of the rendered image. The image returned by\nthe renderer is compared to the data via a pixel-wise Gaussian likelihood model, whose variance is\nalso an unknown variable.\n\nence is that our framework relies on automatic inference techniques, is representationally richer due\nto compact model description and goes beyond point estimates to report posterior uncertainty.\nIn our experiments, we used k-means (with k = 20) to cluster RGB values from a randomly chosen\ntraining image. We used these clusters to build a compact appearance model based on cluster-center\nhistograms, by assigning text image pixels to their nearest cluster. However, we are agnostic to\nthe particular choice of the appearence model and many feature engineering and feature learning\ntechniques can be substituted here without the loss of generality. Our stochastic likelihood incorpo-\nrates these histograms, by multiplying together the appearance probabilities for each image region\nr 2 R. These probabilities, denoted ~\u2713r, are smoothed by pseudo-counts \u270f drawn from a Gamma\ndistribution. Let Zr be the per-region normalizing constant, and ID(x,y) be the quantized pixel at\ncoordinates (x, y) in the input image. Then our likelihood model is:\n\nP (ID|IR,\u270f ) = Yr2R Yx,y s.t. IR=r\n\nID(x,y)\nr\n\n\u2713\n\n+ \u270f\n\nZr\n\nFigure 5f shows appearance model histograms from one random training frame. Figure 5c shows\nthe extremely noisy lane/non-lane classi\ufb01cations that result from the appearance model on its own,\nwithout our scene prior; accuracy is extremely low. Other, richer appearance models, such as Gaus-\nsian mixtures over RGB values (which could be either hand speci\ufb01ed or learned), are compatible\nwith our formulation; our simple, quantized model was chosen primarily for simplicity. We use the\nsame generic Metropolis-Hastings strategy for inference in this problem as in our text application.\nAlthough deterministic search strategies for MAP inference could be developed for this particular\nprogram, it is less clear how to build a single deterministic search algorithm that could work on both\nof the generative probabilistic graphics programs we present.\nIn Table 1, we report the accuracy of our approach on one road dataset from the KITTI Vision\nBenchmark Suite [5]. To focus on accuracy in the face of visual variability, we do not exploit tempo-\nral correspondences. We test on every 5th frame for a total of 80. We report lane/non-lane accuracy\nresults for maximum likelihood classi\ufb01cation over 10 appearance models (from 10 randomly chosen\ntraining images), as well as for the single best appearance model from this set. We use 10 posterior\nsamples per frame for both. For reference, we include the performance of a sophisticated bottom-up\nbaseline system from [2]. This baseline system requires signi\ufb01cant 3D a priori knowledge, including\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n(f)\n\nFigure 5: An illustration of generative probabilistic graphics for 3D road \ufb01nding. (a) Renderings\nof random samples from our scene prior, showing the surface-based image segmentation induced\nby each sample. (b) Representative test frames from the KITTI dataset [5]. (c) Maximum likeli-\nhood lane/non-lane classi\ufb01cation of the images from (b) based solely on the best-performing single-\ntraining-frame appearance model (ignoring latent geometry). Geometric constraints are clearly\nneeded for reliable road \ufb01nding. (d) Results from [2]. (e) Typical inference results from the pro-\nposed generative probabilistic graphics approach on the images from (b). (f) Appearance model his-\ntograms (over quantized RGB values) from the best-performing single-training-frame appearance\nmodel for all four region types: lane, left offroad, right offroad and road.\n\n(a) Lanes\nfrom 30 scenes\nfrom our prior\n\nsuperimposed\nsampled\n\n(b) 30 posterior samples on\na low accuracy (Frame 199),\nhigh uncertainty frame\n\n(c) 30 posterior samples on a\nhigh accuracy (Frame 384),\nlow uncertainty frame\n\n(d) Posterior samples\nof left\nlane position\nfor both frames\n\nFigure 6: Approximate Bayesian inference yields samples from a broad, multimodal scene posterior\non a frame that violates our modeling assumptions (note the intersection), but reports less uncertainty\non a frame more compatible with our model (with perceptually reasonable alternatives to the mode).\n\nthe intrinsic and extrinsic parameters of the camera, and a rough initial segmentation of each test\nimage. In contrast, our approach has to infer these aspects of the scene from the image data. We\nalso show some uncertainty estimates that result from approximate Bayesian inference in Figure 6.\nOur probabilistic graphics program for this problem requires under 20 lines of probabilistic code.\n5 Discussion\nWe have shown that it is possible to write short probabilistic graphics programs that use simple\n2D and 3D computer graphics techniques as the backbone for highly approximate generative mod-\nels. Approximate Bayesian inference over the execution histories of these probabilistic graphics\n\n7\n\n\fASSUME road_width (uniform_discrete 5 8) //arbitrary units\nASSUME road_height (uniform_discrete 70 150)\nASSUME lane_pos_x (uniform_continuous -1.0 1.0) //uncentered renderer\nASSUME lane_pos_y (uniform_continuous -5.0 0.0) //coordinate system\nASSUME lane_pos_z (uniform_continuous 1.0 3.5)\nASSUME lane_size (uniform_continuous 0.10 0.35)\nASSUME eps (gamma 1 1)\nASSUME theta_left (list 0.13 ... 0.03)\nASSUME theta_right (list 0.03 ... 0.02)\nASSUME theta_road (list 0.05 ... 0.07)\nASSUME theta_lane (list 0.01 ... 0.21)\nASSUME data (load_image \"frame201.png\")\nASSUME surfaces (render_surfaces lane_pos_x lane_pos_y lane_pos_z\n\nroad_width road_height lane_size)\n\nOBSERVE (incorporate_stochastic_likelihood theta_left theta_right\n\ntheta_road theta_lane data surfaces eps) True\n\nFigure 7: Source code for a generative probabilistic graphics program that infers 3D road models.\n\nMethod\n\nAccuracy\n\nAly et al [2]\nGPGP (Best Single Appearance)\nGPGP (Maximum Likelihood over Multiple Appearances)\n\n68.31%\n64.56%\n74.60%\n\nTable 1: Quantitative results for lane detection accuracy on one of the road datasets in the KITTI\nVision Benchmark Suite [5]. See main text for details.\n\nprograms \u2014 automatically implemented via generic, single-variable Metropolis-Hastings transi-\ntions, using existing rendering libraries and simple likelihoods \u2014 then implements a new variation\non analysis by synthesis [21]. We have also shown that this approach can yield accurate, globally\nconsistent interpretations of real-world images, and can coherently report posterior uncertainty over\nlatent scenes when appropriate. Our core contributions are the introduction of this conceptual frame-\nwork and two initial demonstrations of its ef\ufb01cacy.\nTo scale our inference approach to handle more complex scenes, it will likely be important to con-\nsider more complex forms of automatic inference, beyond the single-variable Metropolis-Hastings\nproposals we currently use. For example, discriminatively trained proposals could help, and in fact\ncould be trained based on forward executions of the probabilistic graphics program. Appearance\nmodels derived from modern image features and texture descriptors [14, 7, 11] \u2014 going beyond the\nsimple quantizations we currently use \u2014 could also reduce the burden on inference and improve the\ngeneralizability of individual programs. It is important to note that the high dimensionality involved\nin probabilistic graphics programming does not necessarily mean inference (and even automatic in-\nference) is impossible. For example, approximate inference in models with probabilities bounded\naway from 0 and 1 can sometimes be provably tractable via sampling techniques, with runtimes that\ndepend on factors other than dimensionality [3]. Exploring the role of stochasticity in facilitating\ntractability is an important avenue for future work.\nThe most interesting potential of GPGP lies in bringing graphics representations and algorithms\nto bear on the hard modeling and inference problems in vision. For example, to avoid global re-\nrendering after each inference step, we need to represent and exploit the conditional independencies\nbetween latent scene elements and image regions. Inference in GPGP based on a z-buffer or a lay-\nered compositor could potentially do this. We hope the GPGP framework facilitates image analysis\nby Bayesian inversion of rich graphics algorithms for scene generation and image synthesis.\n\nAcknowledgments\nWe are grateful to K. Bonawitz and E. Jonas for preliminary work on CAPTCHA breaking, and to S.\nTeller, B. Freeman, T. Adelson, M. James, M. Siegel and anonymous reviewers for helpful feedback\nand discussions. T. Kulkarni was graciously supported by the Henry E Singleton (1940) Fellowship.\nThis research was supported by ONR award N000141310333, ARO MURI W911NF-13-1-2012,\nthe DARPA UPSIDE program and a gift from Google.\n\n8\n\n\fReferences\n\n[1]\n\nJos\u00b4e Manuel \u00b4Alvarez, Theo Gevers, and Antonio M Lopez. \u201c3D scene priors for road de-\ntection\u201d. In: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on.\nIEEE. 2010, pp. 57\u201364.\n\n[2] Mohamed Aly. \u201cReal time detection of lane markers in urban streets\u201d. In: Intelligent Vehicles\n\nSymposium, 2008 IEEE. IEEE. 2008, pp. 7\u201312.\n\n[3] Paul Dagum and Michael Luby. \u201cAn optimal approximation algorithm for Bayesian infer-\n\nence\u201d. In: Arti\ufb01cial Intelligence 93.1 (1997), pp. 1\u201327.\n\n[4] L Del Pero et al. \u201cBayesian geometric modeling of indoor scenes\u201d. In: Computer Vision and\n\nPattern Recognition (CVPR), 2012 IEEE Conference on. IEEE. 2012, pp. 2719\u20132726.\n\n[5] Andreas Geiger, Philip Lenz, and Raquel Urtasun. \u201cAre we ready for autonomous driving?\nThe KITTI vision benchmark suite\u201d. In: Computer Vision and Pattern Recognition (CVPR),\n2012 IEEE Conference on. IEEE. 2012, pp. 3354\u20133361.\n\n[6] Noah Goodman, Vikash Mansinghka, Daniel Roy, Keith Bonawitz, and Joshua Tenenbaum.\n\n\u201cChurch: A language for generative models\u201d. In: UAI. 2008.\n\n[7] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. \u201cA fast learning algorithm for deep\n\nbelief nets\u201d. In: Neural computation 18.7 (2006), pp. 1527\u20131554.\n\n[8] Derek Hoiem, Alexei A Efros, and Martial Hebert. \u201cPutting objects in perspective\u201d. In: Com-\nputer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on. Vol. 2.\nIEEE. 2006, pp. 2137\u20132144.\n\n[9] Derek Hoiem, Alexei A Efros, and Martial Hebert. \u201cRecovering surface layout from an im-\n\nage\u201d. In: International Journal of Computer Vision 75.1 (2007), pp. 151\u2013172.\n\n[10] Berthold Klaus Paul Horn. Robot vision. the MIT Press, 1986.\n[11] Yann LeCun and Yoshua Bengio. \u201cConvolutional networks for images, speech, and time se-\n\nries\u201d. In: The handbook of brain theory and neural networks 3361 (1995).\n\n[12] Paul Marjoram, John Molitor, Vincent Plagnol, and Simon Tavar\u00b4e. \u201cMarkov chain Monte\nCarlo without likelihoods\u201d. In: Proceedings of the National Academy of Sciences 100.26\n(2003).\n\n[13] Greg Mori and Jitendra Malik. \u201cRecognizing objects in adversarial clutter: Breaking a visual\nCAPTCHA\u201d. In: Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE\nComputer Society Conference on. Vol. 1. IEEE. 2003, pp. I\u2013134.\nJavier Portilla and Eero P Simoncelli. \u201cA parametric texture model based on joint statistics of\ncomplex wavelet coef\ufb01cients\u201d. In: International Journal of Computer Vision 40.1 (2000).\n\n[14]\n\n[15] Oliver Ratmann, Christophe Andrieu, Carsten Wiuf, and Sylvia Richardson. \u201cModel criticism\nbased on likelihood-free inference, with an application to protein network evolution\u201d. In:\n106.26 (2009), pp. 10576\u201310581.\n\n[16] Ray Smith. \u201cAn overview of the Tesseract OCR engine\u201d. In: Ninth International Conference\n\non Document Analysis and Recognition. Vol. 2. IEEE. 2007, pp. 629\u2013633.\n\n[17] Zhuowen Tu, Xiangrong Chen, Alan L Yuille, and Song-Chun Zhu. \u201cImage parsing: Unifying\nsegmentation, detection, and recognition\u201d. In: International Journal of Computer Vision 63.2\n(2005), pp. 113\u2013140.\n\n[18] Zhuowen Tu and Song-Chun Zhu. \u201cImage Segmentation by Data-Driven Markov Chain\n\nMonte Carlo\u201d. In: IEEE Trans. Pattern Anal. Mach. Intell. 24.5 (May 2002).\n\n[19] Richard D Wilkinson. \u201cApproximate Bayesian computation (ABC) gives exact results under\n\nthe assumption of model error\u201d. In: arXiv preprint arXiv:0811.3355 (2008).\n\n[20] David Wingate, Noah D Goodman, A Stuhlmueller, and J Siskind. \u201cNonstandard interpreta-\ntions of probabilistic programs for ef\ufb01cient inference\u201d. In: Advances in Neural Information\nProcessing Systems 23 (2011).\n\n[21] Alan Yuille and Daniel Kersten. \u201cVision as Bayesian inference: analysis by synthesis?\u201d In:\n\nTrends in cognitive sciences 10.7 (2006), pp. 301\u2013308.\n\n[22] Yibiao Zhao and Song-Chun Zhu. \u201cImage Parsing via Stochastic Scene Grammar\u201d. In: Ad-\n\nvances in Neural Information Processing Systems. 2011.\n\n9\n\n\f", "award": [], "sourceid": 759, "authors": [{"given_name": "Vikash", "family_name": "Mansinghka", "institution": "MIT"}, {"given_name": "Tejas", "family_name": "Kulkarni", "institution": "MIT"}, {"given_name": "Yura", "family_name": "Perov", "institution": "MIT"}, {"given_name": "Josh", "family_name": "Tenenbaum", "institution": "MIT"}]}