{"title": "One-shot learning by inverting a compositional causal process", "book": "Advances in Neural Information Processing Systems", "page_first": 2526, "page_last": 2534, "abstract": "People can learn a new visual class from just one example, yet machine learning algorithms typically require hundreds or thousands of examples to tackle the same problems. Here we present a Hierarchical Bayesian model based on compositionality and causality that can learn a wide range of natural (although simple) visual concepts, generalizing in human-like ways from just one image. We evaluated performance on a challenging one-shot classification task, where our model achieved a human-level error rate while substantially outperforming two deep learning models. We also used a visual Turing test\" to show that our model produces human-like performance on other conceptual tasks, including generating new examples and parsing.\"", "full_text": "One-shot learning by inverting a compositional causal\n\nprocess\n\nBrenden M. Lake\n\nRuslan Salakhutdinov\n\nDept. of Brain and Cognitive Sciences\n\nDept. of Statistics and Computer Science\n\nMIT\n\nbrenden@mit.edu\n\nUniversity of Toronto\n\nrsalakhu@cs.toronto.edu\n\nJoshua B. Tenenbaum\n\nDept. of Brain and Cognitive Sciences\n\nMIT\n\njbt@mit.edu\n\nAbstract\n\nPeople can learn a new visual class from just one example, yet machine learn-\ning algorithms typically require hundreds or thousands of examples to tackle the\nsame problems. Here we present a Hierarchical Bayesian model based on com-\npositionality and causality that can learn a wide range of natural (although sim-\nple) visual concepts, generalizing in human-like ways from just one image. We\nevaluated performance on a challenging one-shot classi\ufb01cation task, where our\nmodel achieved a human-level error rate while substantially outperforming two\ndeep learning models. We also tested the model on another conceptual task, gen-\nerating new examples, by using a \u201cvisual Turing test\u201d to show that our model\nproduces human-like performance.\n\nIntroduction\n\n1\nPeople can acquire a new concept from only the barest of experience \u2013 just one or a handful of\nexamples in a high-dimensional space of raw perceptual input. Although machine learning has\ntackled some of the same classi\ufb01cation and recognition problems that people solve so effortlessly,\nthe standard algorithms require hundreds or thousands of examples to reach good performance.\nWhile the standard MNIST benchmark dataset for digit recognition has 6000 training examples per\nclass [19], people can classify new images of a foreign handwritten character from just one example\n(Figure 1b) [23, 16, 17]. Similarly, while classi\ufb01ers are generally trained on hundreds of images per\nclass, using benchmark datasets such as ImageNet [4] and CIFAR-10/100 [14], people can learn a\n\nFigure 1: Can you learn a new concept from just one example? (a & b) Where are the other examples of the\nconcept shown in red? Answers for b) are row 4 column 3 (left) and row 2 column 4 (right). c) The learned\nconcepts also support many other abilities such as generating examples and parsing.\n\n1\n\n123Human drawerscanonical12351235.11235.31236.21236.22137.12137.43128.21238.413212123Simple drawerscanonical1326.5213241231732120213113121012323321171231631220b)c)a)\fFigure 2: Four alphabets from Omniglot, each with \ufb01ve characters drawn by four different people.\n\nnew visual object from just one example (e.g., a \u201cSegway\u201d in Figure 1a). These new larger datasets\nhave developed along with larger and \u201cdeeper\u201d model architectures, and while performance has\nsteadily (and even spectacularly [15]) improved in this big data setting, it is unknown how this\nprogress translates to the \u201cone-shot\u201d setting that is a hallmark of human learning [3, 22, 28].\nAdditionally, while classi\ufb01cation has received most of the attention in machine learning, people\ncan generalize in a variety of other ways after learning a new concept. Equipped with the concept\n\u201cSegway\u201d or a new handwritten character (Figure 1c), people can produce new examples, parse an\nobject into its critical parts, and \ufb01ll in a missing part of an image. While this \ufb02exibility highlights the\nrichness of people\u2019s concepts, suggesting they are much more than discriminative features or rules,\nthere are reasons to suspect that such sophisticated concepts would be dif\ufb01cult if not impossible\nto learn from very sparse data. Theoretical analyses of learning express a tradeoff between the\ncomplexity of the representation (or the size of its hypothesis space) and the number of examples\nneeded to reach some measure of \u201cgood generalization\u201d (e.g., the bias/variance dilemma [8]). Given\nthat people seem to succeed at both sides of the tradeoff, a central challenge is to explain this\nremarkable ability: What types of representations can be learned from just one or a few examples,\nand how can these representations support such \ufb02exible generalizations?\nTo address these questions, our work here offers two contributions as initial steps. First, we introduce\na new set of one-shot learning problems for which humans and machines can be compared side-by-\nside, and second, we introduce a new algorithm that does substantially better on these tasks than\ncurrent algorithms. We selected simple visual concepts from the domain of handwritten characters,\nwhich offers a large number of novel, high-dimensional, and cognitively natural stimuli (Figure\n2). These characters are signi\ufb01cantly more complex than the simple arti\ufb01cial stimuli most often\nmodeled in psychological studies of concept learning (e.g., [6, 13]), yet they remain simple enough\nto hope that a computational model could see all the structure that people do, unlike domains such\nas natural scenes. We used a dataset we collected called \u201cOmniglot\u201d that was designed for studying\nlearning from a few examples [17, 26]. While similar in spirit to MNIST, rather than having 10\ncharacters with 6000 examples each, it has over 1600 character with 20 examples each \u2013 making it\nmore like the \u201ctranspose\u201d of MNIST. These characters were selected from 50 different alphabets on\nwww.omniglot.com, which includes scripts from natural languages (e.g., Hebrew, Korean, Greek)\nand arti\ufb01cial scripts (e.g., Futurama and ULOG) invented for purposes like TV shows or video\ngames. Since it was produced on Amazon\u2019s Mechanical Turk, each image is paired with a movie\n([x,y,time] coordinates) showing how that drawing was produced.\nIn addition to introducing new one-shot learning challenge problems, this paper also introduces\nHierarchical Bayesian Program Learning (HBPL), a model that exploits the principles of composi-\ntionality and causality to learn a wide range of simple visual concepts from just a single example. We\ncompared the model with people and other competitive computational models for character recog-\nnition, including Deep Boltzmann Machines [25] and their Hierarchical Deep extension for learning\nwith very few examples [26]. We \ufb01nd that HBPL classi\ufb01es new examples with near human-level\naccuracy, substantially beating the competing models. We also tested the model on generating new\nexemplars, another natural form of generalization, using a \u201cvisual Turing test\u201d to evaluate perfor-\nmance. In this test, both people and the model performed the same task side by side, and then other\nhuman participants judged which result was from a person and which was from a machine.\n2 Hierarchical Bayesian Program Learning\nWe introduce a new computational approach called Hierarchical Bayesian Program Learning\n(HBPL) that utilizes the principles of compositionality and causality to build a probabilistic gen-\nerative model of handwritten characters.\nIt is compositional because characters are represented\nas stochastic motor programs where primitive structure is shared and re-used across characters at\nmultiple levels, including strokes and sub-strokes. Given the raw pixels, the model searches for a\n\n2\n\n\fFigure 3: An illustration of the HBPL model generating two character types (left and right), where the dotted\nline separates the type-level from the token-level variables. Legend: number of strokes \u03ba, relations R, primitive\nid z (color-coded to highlight sharing), control points x (open circles), scale y, start locations L, trajectories T ,\ntransformation A, noise \u0001 and \u03b8b, and image I.\n\u201cstructural description\u201d to explain the image by freely combining these elementary parts and their\nspatial relations. Unlike classic structural description models [27, 2], HBPL also re\ufb02ects abstract\ncausal structure about how characters are actually produced. This type of causal representation\nis psychologically plausible, and it has been previously theorized to explain both behavioral and\nneuro-imaging data regarding human character perception and learning (e.g., [7, 1, 21, 11, 12, 17]).\nAs in most previous \u201canalysis by synthesis\u201d models of characters, strokes are not modeled at the\nlevel of muscle movements, so that they are abstract enough to be completed by a hand, a foot, or\nan airplane writing in the sky. But HBPL also learns a signi\ufb01cantly more complex representation\nthan earlier models, which used only one stroke (unless a second was added manually) [24, 10] or\nreceived on-line input data [9], sidestepping the challenging parsing problem needed to interpret\ncomplex characters.\nThe model distinguishes between character types (an \u2018A\u2019, \u2018B\u2019, etc.) and tokens (an \u2018A\u2019 drawn by a\nparticular person), where types provide an abstract structural speci\ufb01cation for generating different\ntokens. The joint distribution on types \u03c8, tokens \u03b8(m), and binary images I (m) is given as follows,\n\nM(cid:89)\n\nm=1\n\n\u03ba(cid:89)\n\ni=1\n\nP (\u03c8, \u03b8(1), ..., \u03b8(M ), I (1), ..., I (M )) = P (\u03c8)\n\nP (I (m)|\u03b8(m))P (\u03b8(m)|\u03c8).\n\n(1)\n\nPseudocode to generate from this distribution is shown in the Supporting Information (Section SI-1).\n\n2.1 Generating a character type\nA character type \u03c8 = {\u03ba, S, R} is de\ufb01ned by a set of \u03ba strokes S = {S1, ..., S\u03ba} and spatial relations\nR = {R1, ..., R\u03ba} between strokes. The joint distribution can be written as\n\nP (\u03c8) = P (\u03ba)\n\nP (Si)P (Ri|S1, ..., Si\u22121).\n\n(2)\n\nThe number of strokes is sampled from a multinomial P (\u03ba) estimated from the empirical frequencies\n(Figure 4b), and the other conditional distributions are de\ufb01ned in the sections below. All hyperpa-\nrameters, including the library of primitives (top of Figure 3), were learned from a large \u201cbackground\nset\u201d of character drawings as described in Sections 2.3 and SI-4.\nStrokes. Each stroke is initiated by pressing the pen down and terminated by lifting the\npen up.\nIn between, a stroke is a motor routine composed of simple movements called sub-\nstrokes Si = {si1, ..., sini} (colored curves in Figure 3), where sub-strokes are separated by\n\n3\n\nx(m)11x11y11}}y12x12R1x21y21R2R1R2y(m)11L(m)1y(m)12L(m)2T(m)1T(m)2{A,\u270f,b}(m)I(m)x(m)21y(m)21x(m)12{A,\u270f,b}(m)I(m)L(m)1T(m)1L(m)2T(m)2=independent=alongs11=independent=startofs11z11=17z12=17z21=42z11=5z21=17charactertype1(\uf8ff=2)charactertype2(\uf8ff=2)...172542157primitivesx11y11x(m)11y(m)11x21y21x(m)21y(m)21tokenlevel\u2713(m)typelevel R(m)1R(m)2R(m)1R(m)2\fP (zi)(cid:81)ni\nbrief pauses of the pen. Each sub-stroke sij is modeled as a uniform cubic b-spline, which\ntive motor elements (top of Figure 3), and its distribution P (zi) = P (zi1)(cid:81)ni\ncan be decomposed into three variables sij = {zij, xij, yij} with joint distribution P (Si) =\nj=1 P (xij|zij)P (yij|zij). The discrete class zij \u2208 N is an index into the library of primi-\nj=2 P (zij|zi(j\u22121)) is a\n\ufb01rst-order Markov Process that adds sub-strokes at each step until a special \u201cstop\u201d state is sampled\nthat ends the stroke. The \ufb01ve control points xij \u2208 R10 (small open circles in Figure 3) are sampled\nfrom a Gaussian P (xij|zij) = N (\u00b5zij , \u03a3zij ) , but they live in an abstract space not yet embedded\nin the image frame. The type-level scale yij of this space, relative to the image frame, is sampled\nfrom P (yij|zij) = Gamma(\u03b1zij , \u03b2zij ).\nRelations. The spatial relation Ri speci\ufb01es how the beginning of stroke Si connects to the pre-\nvious strokes {S1, ..., Si\u22121}. The distribution P (Ri|S1, ..., Si\u22121) = P (Ri|z1, ..., zi\u22121), since it\nonly depends on the number of sub-strokes in each stroke. Relations can come in four types with\nprobabilities \u03b8R, and each type has different sub-variables and dimensionalities:\n\n\u2022 Independent relations, Ri = {Ji, Li}, where the position of stroke i does not depend on previ-\nous strokes. The variable Ji \u2208 N is drawn from P (Ji), a multinomial over a 2D image grid that\ndepends on index i (Figure 4c). Since the position Li \u2208 R2 has to be real-valued, P (Li|Ji) is\nthen sampled uniformly at random from within the image cell Ji.\n\u2022 Start or End relations, Ri = {ui}, where stroke i starts at either the beginning or end of a\nprevious stroke ui, sampled uniformly at random from ui \u2208 {1, ..., i \u2212 1}.\n\u2022 Along relations, Ri = {ui, vi, \u03c4i}, where stroke i begins along previous stroke ui \u2208 {1, ..., i \u2212\n1} at sub-stroke vi \u2208 {1, ..., nui} at type-level spline coordinate \u03c4i \u2208 R, each sampled uni-\nformly at random.\n\n2.2 Generating a character token\n\n(cid:89)\n\ni\n\nThe token-level variables, \u03b8(m) = {L(m), x(m), y(m), R(m), A(m), \u03c3(m)\nP (\u03b8(m)|\u03c8) = P (L(m)|\u03b8(m)\n|yi)P (x(m)\n\n|Ri)P (y(m)\n\n\\L(m) , \u03c8)\n\nP (R(m)\n\nb\n\ni\n\ni\n\ni\n\n, \u0001(m)}, are distributed as\n|xi)P (A(m), \u03c3(m)\n, \u0001(m))\n\nb\n\ni\n\n(3)\nwith details below. As before, Sections 2.3 and SI-4 describe how the hyperparameters were learned.\nPen trajectories. A stroke trajectory T (m)\nthat represents the path of the pen. Each trajectory T (m)\nfunction of a starting location L(m)\nscale y(m)\ni\nP (x(m)\n0. To construct the trajectory T (m)\ncontrol points y(m)\nto begin at L(m)\nthe previous sub-stroke\u2019s trajectory, and so on until all sub-strokes are placed.\n\n(Figure 3) is a sequence of points in the image plane\n, y(m)\n) is a deterministic\n\u2208 R10, and token-level\n\u2208 R. The control points and scale are noisy versions of their type-level counterparts,\ny), where the scale is truncated below\n(see illustration in Figure 3), the spline de\ufb01ned by the scaled\n1 \u2208 R10 is evaluated to form a trajectory,1 which is shifted in the image plane\nis evaluated and placed to begin at the end of\n\n, x(m)\n\u2208 R2, token-level control points x(m)\n\n1 x(m)\n. Next, the second spline y(m)\n\n|xij) = N (xij, \u03c32\n\n|yij) \u221d N (yij, \u03c32\n\nxI) and P (y(m)\n\n= f (L(m)\n\n2 x(m)\n\nij\n\nij\n\n2\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\n|Ri) =\n\u2212 Ri), except for the \u201calong\u201d relation which allows for token-level variability for\n\u03c4 ). Given\nis sampled from\ni\u22121 ), \u03a3L), where g(\u00b7) = Li when R(m)\nis\nis start or end, and\n\nToken-level relations must be exactly equal to their type-level counterparts, P (R(m)\n\u03b4(R(m)\nthe attachment along the spline using a truncated Gaussian P (\u03c4 (m)\nthe pen trajectories of the previous strokes,\ni\u22121 ) = N (g(R(m)\nP (L(m)\nindependent (Section 2.1), g(\u00b7) = end(T (m)\ng(\u00b7) is the proper spline evaluation when R(m)\n\nthe start position of L(m)\n, ..., T (m)\n\n, T (m)\n) or g(\u00b7) = start(T (m)\n\n|\u03c4i) \u221d N (\u03c4i, \u03c32\n\n) when R(m)\n\n, ..., T (m)\n\n|R(m)\n\nis along.\n\n, T (m)\n\nui\n\nui\n\n1\n\n1\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\n1The number of spline evaluations is computed to be approximately 2 points for every 3 pixels of distance\n\nalong the spline (with a minimum of 10 evaluations).\n\n4\n\n\fFigure 4: Learned hyper-\nparameters. a) A subset of\nprimitives, where the top\nrow shows the most com-\nmon ones. The \ufb01rst con-\ntrol point (circle) is a \ufb01lled.\nb&c) Empirical distribu-\ntions where the heatmap\nc) show how starting point\ndiffers by stroke number.\nImage. An image transformation A(m) \u2208 R4 is sampled from P (A(m)) = N ([1, 1, 0, 0], \u03a3A),\nwhere the \ufb01rst two elements control a global re-scaling and the second two control a global transla-\ntion of the center of mass of T (m). The transformed trajectories can then be rendered as a 105x105\ngrayscale image, using an ink model adapted from [10] (see Section SI-2). This grayscale image\nis then perturbed by two noise processes, which make the gradient more robust during optimiza-\ntion and encourage partial solutions during classi\ufb01cation. These processes include convolution with\na Gaussian \ufb01lter with standard deviation \u03c3(m)\nand pixel \ufb02ipping with probability \u0001(m), where the\namount of noise \u03c3(m)\nand \u0001(m) are drawn uniformly on a pre-speci\ufb01ed range (Section SI-2). The\ngrayscale pixels then parameterize 105x105 independent Bernoulli distributions, completing the full\nmodel of binary images P (I (m)|\u03b8(m)) = P (I (m)|T (m), A(m), \u03c3(m)\n2.3 Learning high-level knowledge of motor programs\n\n, \u0001(m)).\n\nb\n\nb\n\nb\n\nThe Omniglot dataset was randomly split into a 30 alphabet \u201cbackground\u201d set and a 20 alphabet\n\u201cevaluation\u201d set, constrained such that the background set included the six most common alphabets\nas determined by Google hits. Background images, paired with their motor data, were used to learn\nthe hyperparameters of the HBPL model, including a set of 1000 primitive motor elements (Figure\n4a) and position models for a drawing\u2019s \ufb01rst, second, and third stroke, etc. (Figure 4c). Wherever\npossible, cross-validation (within the background set) was used to decide issues of model complexity\nwithin the conditional probability distributions of HBPL. Details are provided in Section SI-4 for\nlearning the models of primitives, positions, relations, token variability, and image transformations.\n\n2.4\n\nInference\n\nPosterior inference in this model is very challenging, since parsing an image I (m) requires exploring\na large combinatorial space of different numbers and types of strokes, relations, and sub-strokes. We\ndeveloped an algorithm for \ufb01nding K high-probability parses, \u03c8[1], \u03b8(m)[1], ..., \u03c8[K], \u03b8(m)[K], which\nare the most promising candidates proposed by a fast, bottom-up image analysis, shown in Figure\n5a and detailed in Section SI-5. These parses approximate the posterior with a discrete distribution,\n\nK(cid:88)\n\ni=1\n\nP (\u03c8, \u03b8(m)|I (m)) \u2248\n\nwi\u03b4(\u03b8(m) \u2212 \u03b8(m)[i])\u03b4(\u03c8 \u2212 \u03c8[i]),\n\n(4)\n\n(5)\n\nwhere each weight wi is proportional to parse score, marginalizing over shape variables x,\n\nand constrained such that(cid:80)\n\nwi \u221d \u02dcwi = P (\u03c8[i]\\x, \u03b8(m)[i], I (m))\n\ni wi = 1. Rather than using just a point estimate for each parse, the\napproximation can be improved by incorporating some of the local variance around the parse. Since\nthe token-level variables \u03b8(m), which closely track the image, allow for little variability, and since it\nis inexpensive to draw conditional samples from the type-level P (\u03c8|\u03b8(m)[i], I (m)) = P (\u03c8|\u03b8(m)[i]) as\nit does not require evaluating the likelihood of the image, just the local variance around the type-level\nis estimated with the token-level \ufb01xed. Metropolis Hastings is run to produce N samples (Section\nSI-5.5) for each parse \u03b8(m)[i], denoted by \u03c8[i1], ..., \u03c8[iN ], where the improved approximation is\n\nP (\u03c8, \u03b8(m)|I (m)) \u2248 Q(\u03c8, \u03b8(m), I (m)) =\n\nwi\u03b4(\u03b8(m) \u2212 \u03b8(m)[i])\n\n1\nN\n\n\u03b4(\u03c8 \u2212 \u03c8[ij]).\n\n(6)\n\nN(cid:88)\n\nj=1\n\nK(cid:88)\n\ni=1\n\n5\n\na)b)c)12\u2265 4number of strokesstroke start positionslibrary of motor primitives024680200040006000Number of strokesfrequency12341234123412343\fFigure 5: Parsing a raw image. a) The raw image (i) is processed by a thinning algorithm [18] (ii) and then\nanalyzed as an undirected graph [20] (iii) where parses are guided random walks (Section SI-5). b) The \ufb01ve\nbest parses found for that image (top row) are shown with their log wj (Eq. 5), where numbers inside circles\ndenote stroke order and starting position, and smaller open circles denote sub-stroke breaks. These \ufb01ve parses\nwere re-\ufb01t to three different raw images of characters (left in image triplets), where the best parse (top right)\nand its associated image reconstruction (bottom right) are shown above its score (Eq. 9).\nGiven an approximate posterior for a particular image, the model can evaluate the posterior predic-\ntive score of a new image by re-\ufb01tting the token-level variables (bottom Figure 5b), as explained in\nSection 3.1 on inference for one-shot classi\ufb01cation.\n3 Results\n3.1 One-shot classi\ufb01cation\n\nPeople, HBPL, and several alternative models were evaluated on a set of 10 challenging one-shot\nclassi\ufb01cation tasks. The tasks tested within-alphabet classi\ufb01cation on 10 alphabets, with examples\nin Figure 2 and detailed in Section SI-6 . Each trial (of 400 total) consists of a single test image of\na new character compared to 20 new characters from the same alphabet, given just one image each\nproduced by a typical drawer of that alphabet. Figure 1b shows two example trials.\nPeople. Forty participants in the USA were tested on one-shot classi\ufb01cation using Mechanical Turk.\nOn each trial, as in Figure 1b, participants were shown an image of a new character and asked to\nclick on another image that shows the same character. To ensure classi\ufb01cation was indeed \u201cone\nshot,\u201d participants completed just one randomly selected trial from each of the 10 within-alphabet\nclassi\ufb01cation tasks, so that characters never repeated across trials. There was also an instructions\nquiz, two practice trials with the Latin and Greek alphabets, and feedback after every trial.\nHierarchial Bayesian Program Learning. For a test image I (T ) and 20 training images I (c) for\nc = 1, ..., 20, we use a Bayesian classi\ufb01cation rule for which we compute an approximate solution\n(7)\n\nargmax\n\nlog P (I (T )|I (c)).\n\nc\n\nIntuitively, the approximation uses the HBPL search algorithm to get K = 5 parses of I (c), runs\nK MCMC chains to estimate the local type-level variability around each parse, and then runs K\ngradient-based searches to re-optimizes the token-level variables \u03b8(T ) (all are continuous) to \ufb01t the\ntest image I (T ). The approximation can be written as (see Section SI-7 for derivation)\n\n(cid:90)\nK(cid:88)\nP (I (T )|\u03b8(T ))P (\u03b8(T )|\u03c8)Q(\u03b8(c), \u03c8, I (c)) d\u03c8 d\u03b8(c) d\u03b8(T )\n\nN(cid:88)\n\nlog P (I (T )|I (c)) \u2248 log\n\n\u2248 log\n\nwi max\n\u03b8(T )\n\nP (I (T )|\u03b8(T ))\n\nP (\u03b8(T )|\u03c8[ij]),\n\n(8)\n\n(9)\n\ni=1\n\n1\nN\n\nj=1\n\nwhere Q(\u00b7,\u00b7,\u00b7) and wi are from Eq. 6. Figure 5b shows examples of this classi\ufb01cation score. While\ninference so far involves parses of I (c) re\ufb01t to I (T ), it also seems desirable to include parses of I (T )\nre\ufb01t to I (c), namely P (I (c)|I (T )). We can re-write our classi\ufb01cation rule (Eq. 7) to include just the\nreverse term (Eq. 10 center), and then to include both terms (Eq. 10 right), which is the rule we use,\n\nargmax\n\nc\n\nlog P (I (T )|I (c)) = argmax\n\nc\n\nlog\n\nP (I (c)|I (T ))\n\nP (I (c))\n\n= argmax\n\nlog\n\nc\n\nP (I (c)|I (T ))\n\nP (I (c))\n\nP (I (T )|I (c)),\n(10)\n\n6\n\nImageThinnedTraced graph (raw)traced graph (cleaned)Binary imageThinned imageplanningplanning cleanedBinary imageThinned imageplanningplanning cleaned0-60-89-159-168Binary imageThinned imageplanningplanning cleaneda)b)iiiiii12train012\u221259.61\u221288.912\u22121591\u221216812test\u22122.12e+031212\u22121.98e+03121\u22122.07e+03112\u22122.09e+03121\u22122.12e+03112train012\u221259.61\u221288.912\u22121591\u221216812test\u22122.12e+031212\u22121.98e+03121\u22122.07e+03112\u22122.09e+03121\u22122.12e+03112train012\u221259.61\u221288.912\u22121591\u221216812test\u22122.12e+031212\u22121.98e+03121\u22122.07e+03112\u22122.09e+03121\u22122.12e+03112train012\u221259.61\u221288.912\u22121591\u221216812test\u22122.12e+031212\u22121.98e+03121\u22122.07e+03112\u22122.09e+03121\u22122.12e+03112train012\u221259.61\u221288.912\u22121591\u221216812test\u22122.12e+031212\u22121.98e+03121\u22122.07e+03112\u22122.09e+03121\u22122.12e+031-127312train012\u221259.61\u221288.912\u22121591\u221216812test\u22122.12e+031212\u22121.98e+03121\u22122.07e+03112\u22122.09e+03121\u22122.12e+03112train012\u221259.61\u221288.912\u22121591\u221216812test\u22122.12e+031212\u22121.98e+03121\u22122.07e+03112\u22122.09e+03121\u22122.12e+031-83112train012\u221259.61\u221288.912\u22121591\u221216812test\u22128311212\u2212881121\u2212983112\u2212979121\u22121.17e+03112train012\u221259.61\u221288.912\u22121591\u221216812test\u22128311212\u2212881121\u2212983112\u2212979121\u22121.17e+03112train012\u221259.61\u221288.912\u22121591\u221216812test\u22121.41e+031212\u22121.22e+03121\u22121.18e+03112\u22121.72e+03121\u22121.54e+03112train012\u221259.61\u221288.912\u22121591\u221216812test\u22121.41e+031212\u22121.22e+03121\u22121.18e+03112\u22121.72e+03121\u22121.54e+031-2041\f(cid:80)\n\nb\n\nb\n\n, \u0001(T )}).\n\ni \u02dcwi from Eq. 5. These three rules are equivalent if inference is exact, but due\n\nwhere P (I (c)) \u2248\nto our approximation, the two-way rule performs better as judged by pilot results.\nAf\ufb01ne model. The full HBPL model is compared to a transformation-based approach that models\nthe variance in image tokens as just global scales, translations, and blur, which relates to congealing\nmodels [23]. This HBPL model \u201cwithout strokes\u201d still bene\ufb01ts from good bottom-up image analysis\n(Figure 5) and a learned transformation model. The Af\ufb01ne model is identical to HBPL during search,\nbut during classi\ufb01cation, only the warp A(m), blur \u03c3(m)\n, and noise \u0001(m) are re-optimized to a new\nimage (change the argument of \u201cmax\u201d in Eq. 9 from \u03b8(T ) to {A(T ), \u03c3(T )\nDeep Boltzmann Machines (DBMs). A Deep Boltzmann Machine, with three hidden layers of\n1000 hidden units each, was generatively pre-trained on an enhanced background set using the\napproximate learning algorithm from [25]. To evaluate classi\ufb01cation performance, \ufb01rst the approx-\nimate posterior distribution over the DBMs top-level features was inferred for each image in the\nevaluation set, followed by performing 1-nearest neighbor in this feature space using cosine similar-\nity. To speed up learning of the DBM and HD models, the original images were down-sampled, so\nthat each image was represented by 28x28 pixels with greyscale values from [0,1]. To further reduce\nover\ufb01tting and learn more about the 2D image topology, which is built in to some deep models like\nconvolution networks [19], the set of background characters was arti\ufb01cially enhanced by generating\nslight image translations (+/- 3 pixels), rotations (+/- 5 degrees), and scales (0.9 to 1.1).\nHierarchical Deep Model (HD). A more elaborate Hierarchical Deep model is derived by com-\nposing hierarchical nonparametric Bayesian models with Deep Boltzmann Machines [26]. The HD\nmodel learns a hierarchical Dirichlet process (HDP) prior over the activities of the top-level fea-\ntures in a Deep Boltzmann Machine, which allows one to represent both a layered hierarchy of\nincreasingly abstract features and a tree-structured hierarchy of super-classes for sharing abstract\nknowledge among related classes. Given a new test image, the approximate posterior over class\nassignments can be quickly inferred, as detailed in [26].\nSimple Strokes (SS). A much simpler variant of HBPL that infers rigid \u201cstroke-like\u201d parts [16].\nNearest neighbor (NN). Raw images are directly compared using cosine similarity and 1-NN.\nResults. Performance is summarized in Table 1. As predicted, peo-\nple were skilled one-shot learners, with an average error rate of 4.5%.\nHBPL achieved a similar error rate of 4.8%, which was signi\ufb01cantly\nbetter than the alternatives. The Af\ufb01ne model achieved an error rate\nof 18.2% with the classi\ufb01cation rule in Eq. 10 left, while perfor-\nmance was 31.8% error with Eq. 10 right. The deep learning models\nperformed at 34.8% and 38% error, although performance was much\nlower without pre-training (68.3% and 72%). The Simple Strokes and\nNearest Neighbor models had the highest error rates.\n\nTable 1: One-shot classi\ufb01ers\nLearner\nError rate\nHumans\nHBPL\nAf\ufb01ne\nHD\nDBM\nSS\nNN\n\n4.5%\n4.8%\n\n18.2 (31.8%)\n34.8 (68.3%)\n\n38 (72%)\n62.5%\n78.3%\n\n3.2 One-shot generation of new examples\n\nNot only can people classify new examples, they can generate new examples \u2013 even from just one\nimage. While all generative classi\ufb01ers can produce examples, it can be dif\ufb01cult to synthesize a range\nof compelling new examples in their raw form, especially since many models generate only features\nof raw stimuli (e.g, [5]). While DBMs [25] can generate realistic digits after training on thousands\nof examples, how well do these and other models perform from just a single training image?\nWe ran another Mechanical Turk task to produce nine new examples of 50 randomly selected hand-\nwritten character images from the evaluation set. Three of these images are shown in the leftmost\ncolumn of Figure 6. After correctly answering comprehension questions, 18 participants in the USA\nwere asked to \u201cdraw a new example\u201d of 25 characters, resulting in nine examples per character.\nTo simulate drawings from nine different people, each of the models generated nine samples after\nseeing exactly the same images people did, as described in Section SI-8 and shown in Figure 6.\nLow-level image differences were minimized by re-rendering stroke trajectories in the same way for\nthe models and people. Since the HD model does not always produce well-articulated strokes, it\nwas not quantitatively analyzed, although there are clear qualitative differences between these and\nthe human produced images (Figure 6).\n\n7\n\n\fFigure 6: Generating new\nexamples from just a single\n\u201ctarget\u201d image (left). Each\ngrid shows nine new exam-\nples synthesized by peo-\nple and the three computa-\ntional models.\n\nVisual Turing test. To compare the examples generated by people and the models, we ran a visual\nTuring test using 50 new participants in the USA on Mechanical Turk. Participants were told that\nthey would see a target image and two grids of 9 images (Figure 6), where one grid was drawn\nby people with their computer mice and the other grid was drawn by a computer program that\n\u201csimulates how people draw a new character.\u201d Which grid is which? There were two conditions,\nwhere the \u201ccomputer program\u201d was either HBPL or the Af\ufb01ne model. Participants were quizzed\non their comprehension and then they saw 50 trials. Accuracy was revealed after each block of\n10 trials. Also, a button to review the instructions was always accessible. Four participants who\nreported technical dif\ufb01culties were not analyzed.\nResults. Participants who tried to label drawings from people vs. HBPL were only 56% percent cor-\nrect, while those who tried to label people vs. the Af\ufb01ne model were 92% percent correct. A 2-way\nAnalysis of Variance showed a signi\ufb01cant effect of condition (p < .001), but no signi\ufb01cant effect of\nblock and no interaction. While both group means were signi\ufb01cantly better than chance, a subject\nanalysis revealed only 2 of 21 participants were better than chance for people vs. HBPL, while 24\nof 25 were signi\ufb01cant for people vs. Af\ufb01ne. Likewise, 8 of 50 items were above chance for people\nvs. HBPL, while 48 of 50 items were above chance for people vs. Af\ufb01ne. Since participants could\neasily detect the overly consistent Af\ufb01ne model, it seems the dif\ufb01culty participants had in detecting\nHBPL\u2019s exemplars was not due to task confusion. Interestingly, participants did not signi\ufb01cantly\nimprove over the trials, even after seeing hundreds of images from the model. Our results suggest\nthat HBPL can generate compelling new examples that fool a majority of participants.\n\n4 Discussion\n\nHierarchical Bayesian Program Learning (HBPL), by exploiting compositionality and causality, de-\nparts from standard models that need a lot more data to learn new concepts. From just one example,\nHBPL can both classify and generate compelling new examples, fooling judges in a \u201cvisual Turing\ntest\u201d that other approaches could not pass. Beyond the differences in model architecture, HBPL was\nalso trained on the causal dynamics behind images, although just the images were available at eval-\nuation time. If one were to incorporate this compositional and causal structure into a deep learning\nmodel, it could lead to better performance on our tasks. Thus, we do not see our model as the \ufb01nal\nword on how humans learn concepts, but rather, as a suggestion for the type of structure that best\ncaptures how people learn rich concepts from very sparse data. Future directions will extend this\napproach to other natural forms of generalization with characters, as well as speech, gesture, and\nother domains where compositionality and causality are central.\n\nAcknowledgments\nWe would like to thank MIT CoCoSci for helpful feedback. This work was supported by ARO MURI\ncontract W911NF-08-1-0242 and a NSF Graduate Research Fellowship held by the \ufb01rst author.\n\n8\n\nPeopleHBPLAf\ufb01neHDExample\fReferences\n[1] M. K. Babcock and J. Freyd. Perception of dynamic information in static handwritten forms. American\n\nJournal of Psychology, 101(1):111\u2013130, 1988.\n\n[2] I. Biederman. Recognition-by-components: a theory of human image understanding. Psychological\n\n[3] S. Carey and E. Bartlett. Acquiring a single new word. Papers and Reports on Child Language Develop-\n\nReview, 94(2):115\u201347, 1987.\n\nment, 15:17\u201329, 1978.\n\n[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image\n\ndatabase. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.\n\n[5] L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of object categories. IEEE Transactions on Pattern\n\nAnalysis and Machine Intelligence, 28(4):594\u2013611, 2006.\n\n[6] J. Feldman. The structure of perceptual categories. Journal of Mathematical Psychology, 41:145\u2013170,\n\n[7] J. Freyd. Representing the dynamics of a static form. Memory and Cognition, 11(4):342\u2013346, 1983.\n[8] S. Geman, E. Bienenstock, and R. Doursat. Neural Networks and the Bias/Variance Dilemma. Neural\n\n1997.\n\nComputation, 4:1\u201358, 1992.\n\n[9] E. Gilet, J. Diard, and P. Bessi`ere. Bayesian action-perception computational model: interaction of pro-\n\nduction and recognition of cursive letters. PloS ONE, 6(6), 2011.\n\n[10] G. E. Hinton and V. Nair. Inferring motor programs from images of handwritten digits. In Advances in\n\n[11] K. H. James and I. Gauthier. Letter processing automatically recruits a sensory-motor brain network.\n\nNeural Information Processing Systems 19, 2006.\n\nNeuropsychologia, 44(14):2937\u20132949, 2006.\n\n[12] K. H. James and I. Gauthier. When writing impairs reading: letter perception\u2019s susceptibility to motor\n\ninterference. Journal of Experimental Psychology: General, 138(3):416\u201331, Aug. 2009.\n\n[13] C. Kemp and A. Jern. Abstraction and relational learning. In Advances in Neural Information Processing\n\nSystems 22, 2009.\n\n2009.\n\n[14] A. Krizhevsky. Learning multiple layers of features from tiny images. PhD thesis, Unviersity of Toronto,\n\n[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classi\ufb01cation with Deep Convolutional Neural\n\nNetworks. In Advances in Neural Information Processing Systems 25, 2012.\n\n[16] B. M. Lake, R. Salakhutdinov, J. Gross, and J. B. Tenenbaum. One shot learning of simple visual concepts.\n\nIn Proceedings of the 33rd Annual Conference of the Cognitive Science Society, 2011.\n\n[17] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum. Concept learning as motor program induction:\nA large-scale empirical study. In Proceedings of the 34th Annual Conference of the Cognitive Science\nSociety, 2012.\n\n[18] L. Lam, S.-W. Lee, and C. Y. Suen. Thinning Methodologies - A Comprehensive Survey. IEEE Transac-\n\ntions of Pattern Analysis and Machine Intelligence, 14(9):869\u2013885, 1992.\n\n[19] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-Based Learning Applied to Document Recog-\n\nnition. Proceedings of the IEEE, 86(11):2278\u20132323, 1998.\n\n[20] K. Liu, Y. S. Huang, and C. Y. Suen. Identi\ufb01cation of Fork Points on the Skeletons of Handwritten Chinese\nCharacters. IEEE Transactions of Pattern Analysis and Machine Intelligence, 21(10):1095\u20131100, 1999.\n[21] M. Longcamp, J. L. Anton, M. Roth, and J. L. Velay. Visual presentation of single letters activates a\n\npremotor area involved in writing. Neuroimage, 19(4):1492\u20131500, 2003.\n\n[22] E. M. Markman. Categorization and Naming in Children. MIT Press, Cambridge, MA, 1989.\n[23] E. G. Miller, N. E. Matsakis, and P. A. Viola. Learning from one example through shared densities on\ntransformations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\n2000.\n\n[24] M. Revow, C. K. I. Williams, and G. E. Hinton. Using Generative Models for Handwritten Digit Recog-\n\nnition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(6):592\u2013606, 1996.\n\n[25] R. Salakhutdinov and G. E. Hinton. Deep Boltzmann Machines. In 12th Internationcal Conference on\n\nArti\ufb01cial Intelligence and Statistics (AISTATS), 2009.\n\n[26] R. Salakhutdinov, J. B. Tenenbaum, and A. Torralba. Learning with Hierarchical-Deep Models. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 35(8):1958\u201371, 2013.\n\n[27] P. H. Winston. Learning structural descriptions from examples. In P. H. Winston, editor, The Psychology\n\nof Computer Vision. McGraw-Hill, New York, 1975.\n\n[28] F. Xu and J. B. Tenenbaum. Word Learning as Bayesian Inference. Psychological Review, 114(2):245\u2013\n\n272, 2007.\n\n9\n\n\f", "award": [], "sourceid": 1192, "authors": [{"given_name": "Brenden", "family_name": "Lake", "institution": "MIT"}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": "University of Toronto"}, {"given_name": "Josh", "family_name": "Tenenbaum", "institution": "MIT"}]}