{"title": "Burn-in, bias, and the rationality of anchoring", "book": "Advances in Neural Information Processing Systems", "page_first": 2690, "page_last": 2798, "abstract": "Bayesian inference provides a unifying framework for addressing problems in machine learning, artificial intelligence, and robotics, as well as the problems facing the human mind. Unfortunately, exact Bayesian inference is intractable in all but the simplest models. Therefore minds and machines have to approximate Bayesian inference. Approximate inference algorithms can achieve a wide range of time-accuracy tradeoffs, but what is the optimal tradeoff? We investigate time-accuracy tradeoffs using the Metropolis-Hastings algorithm as a metaphor for the mind's inference algorithm(s). We find that reasonably accurate decisions are possible long before the Markov chain has converged to the posterior distribution, i.e. during the period known as burn-in. Therefore the strategy that is optimal subject to the mind's bounded processing speed and opportunity costs may perform so few iterations that the resulting samples are biased towards the initial value. The resulting cognitive process model provides a rational basis for the anchoring-and-adjustment heuristic. The model's quantitative predictions are tested against published data on anchoring in numerical estimation tasks. Our theoretical and empirical results suggest that the anchoring bias is consistent with approximate Bayesian inference.", "full_text": "Emergence of Object-Selective Features in\n\nUnsupervised Feature Learning\n\nAdam Coates, Andrej Karpathy, Andrew Y. Ng\n\nComputer Science Department\n\nStanford University\nStanford, CA 94305\n\n{acoates,karpathy,ang}@cs.stanford.edu\n\nAbstract\n\nRecent work in unsupervised feature learning has focused on the goal of discover-\ning high-level features from unlabeled images. Much progress has been made in\nthis direction, but in most cases it is still standard to use a large amount of labeled\ndata in order to construct detectors sensitive to object classes or other complex\npatterns in the data. In this paper, we aim to test the hypothesis that unsupervised\nfeature learning methods, provided with only unlabeled data, can learn high-level,\ninvariant features that are sensitive to commonly-occurring objects. Though a\nhandful of prior results suggest that this is possible when each object class ac-\ncounts for a large fraction of the data (as in many labeled datasets), it is unclear\nwhether something similar can be accomplished when dealing with completely\nunlabeled data. A major obstacle to this test, however, is scale: we cannot expect\nto succeed with small datasets or with small numbers of learned features. Here,\nwe propose a large-scale feature learning system that enables us to carry out this\nexperiment, learning 150,000 features from tens of millions of unlabeled images.\nBased on two scalable clustering algorithms (K-means and agglomerative cluster-\ning), we \ufb01nd that our simple system can discover features sensitive to a commonly\noccurring object class (human faces) and can also combine these into detectors in-\nvariant to signi\ufb01cant global distortions like large translations and scale.\n\nIntroduction\n\n1\nMany algorithms are now available to learn hierarchical features from unlabeled image data. There\nis some evidence that these algorithms are able to learn useful high-level features without labels, yet\nin practice it is still common to train such features from labeled datasets (but ignoring the labels), and\nto ultimately use a supervised learning algorithm to learn to detect more complex patterns that the\nunsupervised learning algorithm is unable to \ufb01nd on its own. Thus, an interesting open question is\nwhether unsupervised feature learning algorithms are able to construct features, without the bene\ufb01t\nof supervision, that can identify high-level concepts like frequently-occurring object classes. It is\nalready known that this can be achieved when the dataset is suf\ufb01ciently restricted that object classes\nare clearly de\ufb01ned (typically closely cropped images) and occur very frequently [13, 21, 22]. In this\nwork we aim to test whether unsupervised learning algorithms can achieve a similar result without\nany supervision at all.\nThe setting we consider is a challenging one. We have harvested a dataset of 1.4 million image\nthumbnails from YouTube and extracted roughly 57 million 32-by-32 pixel patches at random loca-\ntions and scales. These patches are very different from those found in labeled datasets like CIFAR-\n10 [9]. The overwhelming majority of patches in our dataset appear to be random clutter. In the\ncases where such a patch contains an identi\ufb01able object, it may well be scaled, arbitrarily cropped,\nor uncentered. As a result, it is very unclear where an \u201cobject class\u201d begins or ends in this type of\npatch dataset, and less clear that a completely unsupervised learning algorithm could manage to cre-\n\n1\n\n\fate \u201cobject-selective\u201d features able to distinguish an object from the wide variety of clutter without\nsome other type of supervision.\nIn order to have some hope of success, we can identify several key properties that our learning\nalgorithm should likely have. First, since identi\ufb01able objects show up very rarely, it is clear that\nwe are obliged to train from extremely large datasets. We have no way of controlling how often\na particular object shows up and thus enough data must be used to ensure that an object class is\nseen many times\u2014often enough that it cannot be disregarded as random clutter. Second, we are\nalso likely to need a very large number of features. Training too few features will cause us to\n\u201cunder-\ufb01t\u201d the distribution, forcing the learning algorithm to ignore rare events like objects. Finally,\nas is already common in feature learning work, we should aim to build features that incorporate\ninvariance so that features respond not just to a speci\ufb01c pattern (e.g., an object at a single location\nand scale), but to a range of patterns that collectively belong to the same object class (e.g., the same\nobject seen at many locations and scales). Unfortunately, these desiderata are dif\ufb01cult to achieve at\nonce: current methods for building invariant hierarchies of features are dif\ufb01cult to scale up to train\nmany thousands of features from our 57 million patch dataset on our cluster of 30 machines.\nIn this paper, we will propose a highly scalable combination of clustering algorithms for learning\nselective and invariant features that are capable of tackling this size of problem. Surprisingly, we\n\ufb01nd that despite the simplicity of these algorithms we are nevertheless able to discover high-level\nfeatures sensitive to the most commonly occurring object class present in our dataset: human faces.\nIn fact, we \ufb01nd that these features are better face detectors than a linear \ufb01lter trained from labeled\ndata, achieving up to 86% AUC compared to 77% on labeled validation data. Thus, our results em-\nphasize that not only can unsupervised learning algorithms discover object-selective features with\nno labeled data, but that such features can potentially perform better than basic supervised detectors\ndue to their deep architecture. Though our approach is based on fast clustering algorithms (K-means\nand agglomerative clustering), its basic behavior is essentially similar to existing methods for build-\ning invariant feature hierarchies, suggesting that other popular feature learning methods currently\navailable may also be able to achieve such results if run at large enough scale. Indeed, recent work\nwith a more sophisticated (but vastly more expensive) feature-learning algorithm appears to achieve\nsimilar results [11] when presented with full-frame images.\nWe will begin with a description of our algorithms for learning selective and invariant features, and\nexplain their relationship to existing systems. We will then move on to presenting our experimental\nresults. Related results and methods to our own will be reviewed brie\ufb02y before concluding.\n\n2 Algorithm\n\nOur system is built on two separate learning modules: (i) an algorithm to learn selective features\n(linear \ufb01lters that respond to a speci\ufb01c input pattern), and (ii) an algorithm to combine the selective\nfeatures into invariant features (that respond to a spectrum of gradually changing patterns). We\nwill refer to these features as \u201csimple cells\u201d and \u201ccomplex cells\u201d respectively, in analogy to previous\nwork and to biological cells with (very loosely) related response properties. Following other popular\nsystems [14, 12, 6, 5] we will then use these two algorithms to build alternating layers of simple cell\nand complex cell features.\n\n2.1 Learning Selective Features (Simple Cells)\n\nThe \ufb01rst module in our learning system trains a bank of linear \ufb01lters to represent our selective\n\u201csimple cell\u201d features. For this purpose we use the K-means-like method used by [2], which has\npreviously been used for large-scale feature learning.\nThe algorithm is given a set of input vectors x(i) \u2208 (cid:60)n, i = 1, . . . , m. These vectors are pre-\nprocessed by removing the mean and normalizing each example, then performing PCA whitening.\nWe then learn a dictionary D \u2208 (cid:60)n\u00d7d of linear \ufb01lters as in [2] by alternating optimization over\n\ufb01lters D and \u201ccluster assignments\u201d C:\n\nD,C\n\n||DC (i) \u2212 x(i)||2\nminimize\nsubject to ||D(j)||2 = 1,\u2200j,\nand ||C (i)||0 \u2264 1,\u2200i.\n\n2\n\n2\n\n\fHere the constraint ||C (i)||0 \u2264 1 means that the vectors C (i), i = 1, . . . , m are allowed to contain\nonly a single non-zero, but the non-zero value is otherwise unconstrained. Given the linear \ufb01lters\nD, we then de\ufb01ne the responses of the learned simple cell features as s(i) = g(a(i)) where a(i) =\nD(cid:62)x(i) and g(\u00b7) is a nonlinear activation function. In our experiments we will typically use g(a) =\n|a| for the \ufb01rst layer of simple cells, and g(a) = a for the second.1\n\n2.2 Learning Invariant Features (Complex Cells)\n\nTo construct invariant complex cell features a common approach is to create \u201cpooling units\u201d that\ncombine the responses of lower-level simple cells. In this work, we use max-pooling units [14, 13].\nSpeci\ufb01cally, given a vector of simple cell responses s(i), we will train complex cell features whose\nresponses are given by:\n\nc(i)\nj = max\nk\u2208Gj\n\ns(i)\nk\n\nwhere Gj is a set that speci\ufb01es which simple cells the j\u2019th complex cell should pool over. Thus, the\ncomplex cell cj is an invariant feature that responds signi\ufb01cantly to any of the patterns represented\nby simple cells in its group.\nEach group Gj should specify a set of simple cells that are, in some sense, similar to one another.\nIn convolutional neural networks [12], for instance, each group is hard-coded to include translated\ncopies of the same \ufb01lter resulting in complex cell responses cj that are invariant to small translations.\nSome algorithms [6, 3] \ufb01x the groups Gj ahead of time then optimize the simple cell \ufb01lters D\nso that the simple cells in each group share a particular form of statistical dependence.\nIn our\nsystem, we will use linear correlation of simple cell responses as our similarity metric, E[akal], and\nconstruct groups Gj that combine similar features according to this metric. Computing the similarity\ndirectly would normally require us to estimate the correlations from data, but since the inputs x(i)\nare whitened we can instead compute the similarity directly from the \ufb01lter weights:\n\nE[akal] = E[D(k)(cid:62)x(i)x(i)(cid:62)\n\nD(l)] = D(k)(cid:62)D(l).\n\nd(k, l) = ||D(k) \u2212 D(l)||2 =(cid:112)2 \u2212 2E[akal].\n\nFor convenience in the following, we will actually use the dissimilarity between features, de\ufb01ned as\n\nTo construct the groups G, we will use a version of single-link agglomerative clustering to combine\nsets of features that have low dissimilarity according to d(k, l).2 To construct a single group G0 we\nbegin by choosing a random simple cell \ufb01lter, say D(k), as the \ufb01rst member. We then search for\ncandidate cells to be added to the group by computing d(k, l) for each simple cell \ufb01lter D(l) and add\nD(l) to the group if d(k, l) is less than some limit \u03c4. The algorithm then continues to expand G0 by\nadding any additional simple cells that are closer than \u03c4 to any one of the simple cells already in the\ngroup. This procedure continues until there are no more cells to be added, or until the diameter of\nthe group (the dissimilarity between the two furthest cells in the group) reaches a limit \u2206.3\nThis procedure can be executed, quite rapidly, in parallel for a large number of randomly chosen\nsimple cells to act as the \u201cseed\u201d cell, thus allowing us to train many complex cells at once. Com-\npared to the simple cell learning procedure, the computational cost is extremely small even for our\nrudimentary implementation. In practice, we often generate many groups (e.g., several thousand)\nand then keep only a random subset of the largest groups. This ensures that we do not end up with\nmany groups that pool over very few simple cells (and hence yield complex cells cj that are not\nespecially invariant).\n\n2.3 Algorithm Behavior\n\nThough it seems plausible that pooling simple cells with similar-looking \ufb01lters according to d(k, l)\nas above should give us some form of invariant feature, it may not yet be clear why this form of\n\n1This allows us to train roughly half as many simple cell features for the \ufb01rst layer.\n2Since the \ufb01rst layer uses g(a) = |a|, we actually use d(k, l) = min{||D(k) \u2212 D(l)||2,||D(k) + D(l)||2}\n\nto account for \u2212D(l) and +D(l) being essentially the same feature.\n\n3We use \u03c4 = 0.3 for the \ufb01rst layer of complex cells and \u03c4 = 1.0 for the second layer. These were chosen\n2 so\n\nby examining the typical distance between a \ufb01lter D(k) and its nearest neighbor. We use \u2206 = 1.5 >\nthat a complex cell group may include orthogonal \ufb01lters but cannot grow without limit.\n\n\u221a\n\n3\n\n\finvariance is desirable. To explain, we will consider a simple \u201ctoy\u201d data distribution where the\nbehavior of these algorithms is more clear. Speci\ufb01cally, we will generate three heavy-tailed random\nvariables X, Y, Z according to:\n\n\u03c31, \u03c32 \u223c L(0, \u03bb)\ne1, e2, e3 \u223c N (0, 1)\n\nX = e1\u03c31, Y = e2\u03c31, Z = e3\u03c32\n\nHere, \u03c31, \u03c32 are scale parameters sampled independently from a Laplace distribution, and e1, e2, e3\nare sampled independently from a unit Gaussian. The result is that Z is independent of both X and\nY , but X and Y are not independent due to their shared scale parameter \u03c31 [6]. An isocontour of\nthe density of this distribution is shown in Figure 1a.\nOther popular algorithms [6, 5, 3] for learning complex-cell features are designed to identify X and\nY as features to be pooled together due to the correlation in their energies (scales). One empirical\nmotivation for this kind of invariance comes from natural images: if we have three simple-cell \ufb01lter\nresponses a1 = D(1)(cid:62)x, a2 = D(2)(cid:62)x, a3 = D(3)(cid:62)x where D(1) and D(2) are Gabor \ufb01lters in\nquadrature phase, but D(3) is a Gabor \ufb01lter at a different orientation, then the responses a1, a2, a3\nwill tend to have a distribution very similar to the model of X, Y, Z above [7]. By pooling together\nthe responses of a1 and a2 a complex cell is able to detect an edge of \ufb01xed orientation invariant\nto small translations. This model also makes sense for higher-level invariances where X and Y do\nnot merely represent responses of linear \ufb01lters on image patches but feature responses in a deep\nnetwork. Indeed, the X\u2013Y plane in Figure 1a is referred to as an \u201cinvariant subspace\u201d [8].\nOur combination of simple cell and complex cell learning algorithms above tend to learn this same\ntype of invariance. After whitening and normalization, the data points X, Y, Z drawn from the\ndistribution above will lie (roughly) on a sphere. The density of these data points is pictured in\nFigure 1b, where it can be seen that the highest density areas are in a \u201cbelt\u201d in the X\u2013Y plane and\nat the poles along the Z axis with a low-density region in between. Application of our K-means\nclustering method to this data results in centroids shown as \u2217 marks in Figure 1b. From this picture\nit is clear what a subsequent application of our single-link clustering algorithm will do: it will try to\nstring together the centroids around the \u201cbelt\u201d that forms the invariant subspace and avoid connecting\nthem to the (distant) centroids at the poles. Max-pooling over the responses of these \ufb01lters will result\nin a complex cell that responds consistently to points in the X\u2013Y plane, but not in the Z direction\u2014\nthat is, we end up with an invariant feature detector very similar to those constructed by existing\nmethods. Figure 1c depicts this result, along with visualizations of the hypothetical gabor \ufb01lters\nD(1), D(2), D(3) described above that might correspond to the learned centroids.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 1: (a) An isocontour of a sparse probability distribution over variables X, Y, and Z. (See text\nfor details.) (b) A visualization of the spherical density obtained from the distribution in (a) after\nnormalization. Red areas are high density and dark blue areas are low density. Centroids learned\nby K-means from this data are shown on the surface of the sphere as * marks. (c) A pooling unit\nidenti\ufb01ed by applying single-link clustering to the centroids (black links join pooled \ufb01lters). (See\ntext.)\n\n2.4 Feature Hierarchy\n\nNow that we have de\ufb01ned our simple and complex cell learning algorithms, we can use them to train\nalternating layers of selective and invariant features. We will train 4 layers total, 2 of each type. The\narchitecture we use is pictured in Figure 2a.\n\n4\n\n\f(a)\n\n(b)\n\nFigure 2: (a) Cross-section of network architecture used for experiments. Full layer sizes are shown\nat right. (b) Randomly selected 128-by-96 images from our dataset.\n\nOur \ufb01rst layer of simple cell features are locally connected to 16 non-overlapping 8-by-8 pixel\npatches within the 32-by-32 pixel image. These features are trained by building a dataset of 8-by-\n8 patches and passing them to our simple cell learning procedure to train 6400 \ufb01rst-layer \ufb01lters\nD \u2208 (cid:60)64\u00d76400. We apply our complex cell learning procedure to this bank of \ufb01lters to \ufb01nd 128\npooling groups G1, G2, . . . , G128. Using these results, we can extract our simple cell and complex\ncell features from each 8-by-8 pixel subpatch of the 32-by-32 image. Speci\ufb01cally, the linear \ufb01lters D\ni = g(D(i)(cid:62)x(p)) where x(p), p = 1, .., 16\nare used to extract the \ufb01rst layer simple cell responses s(p)\nare the 16 subpatches of the 32-by-32 image. We then compute the complex cell feature responses\nc(p)\nj = maxk\u2208Gj s(p)\nOnce complete, we have an array of 128-by-4-by-4 = 2048 complex cell responses c representing\neach 32-by-32 image. These responses are then used to form a new dataset from which to learn a\nsecond layer of simple cells with K-means. In our experiments we train 150,000 second layer simple\ncells. We denote the second layer of learned \ufb01lters as \u00afD, and the second layer simple cell responses\nas \u00afs = \u00afD(cid:62)c. Applying again our complex cell learning procedure to \u00afD, we obtain pooling groups\n\u00afG, and complex cells \u00afc de\ufb01ned analogously.\n\nfor each patch.\n\nk\n\n3 Experiments\nAs described above, we ran our algorithm on patches harvested from YouTube thumbnails down-\nloaded from the web. Speci\ufb01cally, we downloaded the thumbnails for over 1.4 million YouTube\nvideos4, some of which are shown in Figure 2b. These images were downsampled to 128-by-96\npixels and converted to grayscale. We cropped 57 million randomly selected 32-by-32 pixel patches\nfrom these images to form our unlabeled training set. No supervision was used\u2014thus most patches\ncontain partial views of objects or clutter at differing scales. We ran our algorithm on these images\nusing a cluster of 30 machines over 3 days\u2014virtually all of the time spent training the 150,000\nsecond-layer features.5 We will now visualize these features and check whether any of them have\nlearned to identify an object class.\n\n3.1 Low-Level Simple and Complex Cell Visualizations\n\nWe visualize the learned low-level \ufb01lters D and pooling groups G to verify that they are, in fact,\nsimilar to those learned by other well-known algorithms. It is already known that our K-means-\nbased algorithm learns simple-cell-like \ufb01lters (e.g., edge-like features, as well as spots, curves) as\nshown in Figure 3a.\nTo visualize the learned complex cells we inspect the simple cell \ufb01lters that belong to each of the\npooling groups. The \ufb01lters for several pooling groups are shown in Figure 3b. As expected, the \ufb01lters\ncover a spectrum of similar image structures. Though many pairs of \ufb01lters are extremely similar6,\n\n4We cannot select videos at random, so we query videos under each YouTube category (\u201cPets & Animals\u201d,\n\n\u201cScience & Technology\u201d, etc.) along with a date (e.g., \u201cJanuary 2001\u201d).\n\n5Though this is a fairly long run, we note that 1 iteration of K-means is cheaper than a single batch gradient\nstep for most other methods able to learn high-level invariant features. We expect that these experiments would\nbe impossible to perform in a reasonable amount of time on our cluster with another algorithm.\n\n6Some \ufb01lters have reversed polarity due to our use of absolute-value recti\ufb01cation during training of the \ufb01rst\n\nlayer.\n\n5\n\n\fthere are also other pairs that differ signi\ufb01cantly yet are included in the group due to the single-\nlink clustering method. Note that some of our groups are composed of similar edges at differing\nlocations, and thus appear to have learned translation invariance as expected.\n\n3.2 Higher-Level Simple and Complex Cells\n\nFinally, we inspect the learned higher layer simple cell and complex cell features, \u00afs and \u00afc, partic-\nularly to see whether any of them are selective for an object class. The most commonly occurring\nobject in these video thumbnails is human faces (even though we estimate that much less than 0.1%\nof patches contain a well-framed face). Thus we search through our learned features for cells that are\nselective for human faces at varying locations and scales. To locate such features we use a dataset\nof labeled images: several hundred thousand non-face images as well as tens of thousands of known\nface images from the \u201cLabeled Faces in the Wild\u201d (LFW) dataset [4].7\nTo test whether any of the \u00afs simple cell features are selective for faces, we use each feature by itself\nas a \u201cdetector\u201d on the labeled dataset: we compute the area under the precision-recall curve (AUC)\nobtained when each feature\u2019s response \u00afsi is used as a simple classi\ufb01er. Indeed, it turns out that there\nare a handful of high-level features that tend to be good detectors for faces. The precision-recall\ncurves for the best 5 detectors are shown in Figure 3c (top curves); the best of these achieves 86%\nAUC. We visualize 16 of the simple cell features identi\ufb01ed by this procedure8 in Figure 4a along\nwith a sampling of the image patches that activate the \ufb01rst of these cells strongly. There it can be\nseen that these simple cells are selective for faces located at particular locations and scales. Within\neach group the faces differ slightly due to the learned invariance provided by the complex cells in\nthe lower layer (and thus the mean of each group of images is blurry).\n\n(a)\n\n(b)\n\n(c)\n\nFigure 3: (a) First layer simple cell \ufb01lters learned by K-means. (b) Sets of simple cell \ufb01lters be-\nlonging to three pooling groups learned by our complex cell training algorithm. (c) Precision-Recall\ncurves showing selectivity for human faces of 5 low-level simple cells trained from a full 32-by-32\npatch (red curves, bottom) versus 5 higher-level simple cells (green curves, top). Performance of the\nbest linear \ufb01lter found by SVM from labeled data is also shown (black dotted curve, middle).\n\nIt may appear that this result could be obtained by applying our simple cell learning procedure\ndirectly to full 32-by-32 images without any attempts at incorporating local invariance. That is,\nrather than training D (the \ufb01rst-layer \ufb01lters) from 8-by-8 patches, we could try to train D directly\nfrom the 32-by-32 images. This turns out not to be successful. The lower curves in Figure 3c are the\nprecision-recall curves for the best 5 simple cells found in this way. Clearly the higher-level features\nare dramatically better detectors than simple cells built directly from pixels9 (only 64% AUC).\n\n7Our positive face samples include the entire set of labeled faces, plus randomly scaled and translated\n\n8We visualize the higher-level features by averaging together the 100 unlabeled images from our YouTube\n\ncopies.\n\ndataset that elicit the strongest activation.\n\n9These simple cells were trained by applying K-means to normalized, whitened 32-by-32 pixel patches from\na smaller unlabeled set known to have a higher concentration of faces. Due to this, a handful of centroids look\nroughly like face exemplars and act as simple \u201ctemplate matchers\u201d. When trained on the full dataset (which\ncontains far fewer faces), K-means learns only edge and arc features which perform much worse (about 45%\nAUC).\n\n6\n\n00.10.20.30.40.50.60.70.80.910.40.50.60.70.80.91RecallPrecision\fBest 32-by-32 simple cell Best in \u00afs Best in \u00afc\n\nAUC\n\n64%\n\n86%\n\n80%\n\nSupervised Linear SVM\n\n77%\n\nTable 1: Area under PR curve for different cells on our face detection validation set. Only the SVM\nuses labeled data.\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 4: Visualizations. (a) A collection of patches from our unlabeled dataset that maximally\nactivate one of the high-level simple cells from \u00afs. (b) The mean of the top stimuli for a handful\nof face-selective cells in \u00afs. (c) Visualization of the face-selective cells that belong to one of the\ncomplex cells in \u00afc discovered by the single-link clustering algorithm applied to \u00afD. (d) A collection\nof unlabeled patches that elicit a strong response from the complex cell visualized in (c) \u2014 virtually\nall are faces, at a variety of scales and positions. Compare to (a).\n\nAs a second control experiment we train a linear SVM from half of the labeled data using only\npixels as input (contrast-normalized and whitened). The PR curve for this linear classi\ufb01er is shown\nin Figure 3c as a black dotted line. There we see that the supervised linear classi\ufb01er is signi\ufb01cantly\nbetter (77% AUC) than the 32-by-32 linear simple cells. On the other hand, it does not perform as\nwell as the higher level simple cells learned by our system even though it is likely the best possible\nlinear detector.\nFinally, we inspect the higher-level complex cells learned by the applying the same agglomerative\nclustering procedure to the higher-level simple cell \ufb01lters. Due to the invariance introduced at the\nlower layers, two simple cells that detect faces at slightly different locations or scales will often have\nvery similar \ufb01lter weights and thus we expect our algorithm to \ufb01nd and combine these simple cells\ninto higher-level invariant features cells.\nTo visualize our higher-level complex cell features \u00afc, we can simply look at visualizations for all of\nthe simple cells in each of the groups \u00afG. These visualizations show us the set of patches that strongly\nactivate each simple cell, and hence also activate the complex cell. The results of such a visualization\nfor one group that was found to contain only face-selective cells is shown in Figure 4c. There it can\nbe seen that this single \u201ccomplex cell\u201d selects for faces at multiple positions and scales. A sampling\nof image patches collected from the unlabeled data that strongly activate the corresponding complex\ncell are shown in Figure 4d. We see that the complex cell detects many faces but at a much wider\nvariety of positions and scales compared to the simple cells, demonstrating that even \u201chigher level\u201d\ninvariances are being captured, including scale invariance. Benchmarked on our labeled set, this\ncomplex cell achieves 80.0% AUC\u2014somewhat worse than the very best simple cells, but still in the\ntop 10 performing cells in the entire network. Interestingly, the qualitative results in Figure 4d are\nexcellent, and we believe these images represent an even greater range of variations than those in\nthe labeled set. Thus the 80% AUC number may somewhat under-rate the quality of these features.\nThese results suggest that the basic notions of invariance and selectivity that underpin popular feature\nlearning algorithms may be suf\ufb01cient to discover the kinds of high-level features that we desire,\npossibly including whole object classes robust to local and global variations. Indeed, using simple\nimplementations of selective and invariant features closely related to existing algorithms, we have\nfound that is possible to build features with high selectivity for a coherent, commonly occurring\nobject class. Though human faces occur only very rarely in our very large dataset, it is clear that the\ncomplex cell visualized Figure 4d is adept at spotting them amongst tens of millions of images. The\nenabler for these results is the scalability of the algorithms we have employed, suggesting that other\nsystems can likely achieve similar results to the ones shown here if their computational limitations\nare overcome.\n\n7\n\n\f4 Related Work\n\nThe method that we have proposed has close connections to a wide array of prior work. For instance,\nthe basic notions of selectivity and invariance that drive our system can be identi\ufb01ed in many other\nalgorithms: Group sparse coding methods [3] and Topographic ICA [6, 7] build invariances by\npooling simple cells that lie in an invariant subspace, identi\ufb01ed by strong scale correlations between\ncell responses. The advantage of this criterion is that it can determine which features to pool together\neven when the simple cell \ufb01lters are orthogonal (where they would be too far apart for our algorithm\nto recognize their relationship). Our results suggest that while this type of invariance is very useful,\nthere exist simple ways of achieving a similar effect.\nOur approach is also connected with methods that attempt to model the geometric (e.g., manifold)\nstructure of the input space. For instance, Contractive Auto-Encoders [16, 15], Local Coordinate\nCoding [20], and Locality-constrained Linear Coding [19] learn sparse linear \ufb01lters while attempting\nto model the manifold structure staked out by these \ufb01lters (sometimes termed \u201canchor points\u201d).\nOne interpretation of our method, suggested by Figure 1b, is that with extremely overcomplete\ndictionaries it is possible to use trivial distance calculations to identify neighboring points on the\nmanifold. This in turn allows us to construct features invariant to shifts along the manifold with\nlittle effort. [1] use similar intuitions to propose a clustering method similar to our approach.\nOne of our key results, the unsupervised discovery of features selective for human faces is fairly\nunique (though seen recently in the extremely large system of [11]). Results of this kind have\nappeared previously in restricted settings. For instance, [13] trained Deep Belief Network models\nthat decomposed object classes like faces and cars into parts using a probabilistic max-pooling to\ngain translation invariance. Similarly, [21] has shown results of a similar \ufb02avor on the Caltech\nrecognition datasets.\n[22] showed that a probabilistic model (with some hand-coded geometric\nknowledge) can recover clusters containing 20 known object class silhouettes from outlines in the\nLabelMe dataset. Other authors have shown the ability to discover detailed manifold structure (e.g.,\nas seen in the results of embedding algorithms [18, 17]) when trained in similarly restricted settings.\nThe structure that these methods discover, however, is far more apparent when we are using labeled,\ntightly cropped images. Even if we do not use the labels themselves the labeled examples are, by\nconstruction, highly clustered: faces will be separated from other objects because there are no partial\nfaces or random clutter. In our dataset, no supervision is used except to probe the representation post\nhoc.\nFinally, we note the recent, extensive \ufb01ndings of Le et al. [11]. In that work an extremely large 9-\nlayer neural network based on a TICA-like learning algorithm [10, 6] is also capable of identifying\na wide variety of object classes (including cats and upper-bodies of people) seen in YouTube videos.\nOur results complement this work in several key ways. First, by training on smaller randomly\ncropped patches, we show that object-selectivity may still be obtained even when objects are almost\nnever framed properly within the image\u2014ruling out this bias as the source of object-selectivity. Sec-\nond, we have shown that the key concepts (sparse selective \ufb01lters and invariant-subspace pooling)\nused in their system can also be implemented in a different way using scalable clustering algo-\nrithms, allowing us to achieve results reminiscent of theirs using a vastly smaller amount of com-\nputing power. (We used 240 cores, while their large-scale system is composed of 16,000 cores.) In\ncombination, these results point strongly to the conclusion that almost any highly scalable imple-\nmentation of existing feature-learning concepts is enough to discover these sophisticated high-level\nrepresentations.\n\n5 Conclusions\n\nIn this paper we have presented a feature learning system composed of two highly scalable but other-\nwise very simple learning algorithms: K-means clustering to \ufb01nd sparse linear \ufb01lters (\u201csimple cells\u201d)\nand agglomerative clustering to stitch simple cells together into invariant features (\u201ccomplex cells\u201d).\nWe showed that these two components are, in fact, capable of learning complicated high-level rep-\nresentations in large scale experiments on unlabeled images pulled from YouTube. Speci\ufb01cally, we\nfound that higher level simple cells could learn to detect human faces without any supervision at\nall, and that our complex-cell learning procedure combined these into even higher-level invariances.\nThese results indicate that we are apparently equipped with many of the key principles needed to\nachieve such results and that a critical remaining puzzle is how to scale up our algorithms to the\nsizes needed to capture more object classes and even more sophisticated invariances.\n\n8\n\n\fReferences\n[1] Y. Boureau, N. L. Roux, F. Bach, J. Ponce, and Y. LeCun. Ask the locals: multi-way local\npooling for image recognition. In 13th International Conference on Computer Vision, pages\n2651\u20132658, 2011.\n\n[2] A. Coates and A. Y. Ng. The importance of encoding versus training with sparse coding and\nvector quantization. In International Conference on Machine Learning, pages 921\u2013928, 2011.\n[3] P. Garrigues and B. Olshausen. Group sparse coding with a laplacian scale mixture prior. In\n\nAdvances in Neural Information Processing Systems 23, pages 676\u2013684, 2010.\n\n[4] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A\ndatabase for studying face recognition in unconstrained environments. Technical Report 07-\n49, University of Massachusetts, Amherst, October 2007.\n\n[5] A. Hyv\u00a8arinen and P. Hoyer. Emergence of phase-and shift-invariant features by decomposition\nof natural images into independent feature subspaces. Neural Computation, 12(7):1705\u20131720,\n2000.\n\n[6] A. Hyv\u00a8arinen, P. Hoyer, and M. Inki. Topographic independent component analysis. Neural\n\nComputation, 13(7):1527\u20131558, 2001.\n\n[7] A. Hyv\u00a8arinen, J. Hurri, and P. Hoyer. Natural Image Statistics. Springer-Verlag, 2009.\n[8] T. Kohonen. Emergence of invariant-feature detectors in self-organization. In M. Palaniswami\net al., editor, Computational Intelligence, A Dynamic System Perspective, pages 17\u201331. IEEE\nPress, New York, 1995.\n\n[9] A. Krizhevsky. Learning multiple layers of features from Tiny Images. Master\u2019s thesis, Dept.\n\nof Comp. Sci., University of Toronto, 2009.\n\n[10] Q. Le, A. Karpenko, J. Ngiam, and A. Ng. ICA with reconstruction cost for ef\ufb01cient overcom-\n\nplete feature learning. In Advances in Neural Information Processing Systems, 2011.\n\n[11] Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, and A. Ng. Building\nhigh-level features using large scale unsupervised learning. In International Conference on\nMachine Learning, 2012.\n\n[12] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel.\nBackpropagation applied to handwritten zip code recognition. Neural Computation, 1:541\u2013\n551, 1989.\n\n[13] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for\nscalable unsupervised learning of hierarchical representations. In International Conference on\nMachine Learning, pages 609\u2013616, 2009.\n\n[14] M. Riesenhuber and T. Poggio. Hierarchical models of object recognition in cortex. Nature\n\nneuroscience, 2, 1999.\n\n[15] S. Rifai, Y. Dauphin, P. Vincent, Y. Bengio, and X. Muller. The manifold tangent classi\ufb01er. In\n\nAdvances in Neural Information Processing, 2011.\n\n[16] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio. Contractive auto-encoders: Explicit\ninvariance during feature extraction. In International Conference on Machine Learning, 2011.\n[17] S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear embedding. Sci-\n\nence, 290(5500):2323\u20142326, December 2000.\n\n[18] L. van der Maaten and G. Hinton. Visualizing high-dimensional data using t-SNE. Journal of\n\nMachine Learning Research, 9:2579\u20142605, November 2008.\n\n[19] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained linear coding for\n\nimage classi\ufb01cation. In Computer Vision and Pattern Recognition, pages 3360\u20133367, 2010.\n\n[20] K. Yu, T. Zhang, and Y. Gong. Nonlinear learning using local coordinate coding. In Advances\n\nin Neural Information Processing Systems 22, pages 2223\u20132231, 2009.\n\n[21] M. D. Zeiler, G. W. Taylor, and R. Fergus. Adaptive deconvolutional networks for mid and\n\nhigh level feature learning. In International Conference on Computer Vision, 2011.\n\n[22] L. Zhu, Y. Chen, A. Torralba, W. Freeman, and A. Yuille. Part and Appearance Sharing:\nRecursive Compositional Models for Multi-View Multi-Object Detection. In Computer Vision\nand Pattern Recognition, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1250, "authors": [{"given_name": "Falk", "family_name": "Lieder", "institution": null}, {"given_name": "Tom", "family_name": "Griffiths", "institution": null}, {"given_name": "Noah", "family_name": "Goodman", "institution": null}]}