{"title": "FloatBoost Learning for Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 1017, "page_last": 1024, "abstract": null, "full_text": "FloatBoost Learning for Classi\ufb01cation\n\nStan Z. Li\n\nMicrosoft Research Asia\n\nBeijing, China\n\nHeung-Yeung Shum\n\nMicrosoft Research Asia\n\nBeijing, China\n\nZhenQiu Zhang\u0001\nInstitute of Automation\nCAS, Beijing, China\n\nHongJiang Zhang\n\nMicrosoft Research Asia\n\nBeijing, China\n\nAbstract\n\nAdaBoost [3] minimizes an upper error bound which is an exponential\nfunction of the margin on the training set [14]. However, the ultimate\ngoal in applications of pattern classi\ufb01cation is always minimum error\nrate. On the other hand, AdaBoost needs an effective procedure for\nlearning weak classi\ufb01ers, which by itself is dif\ufb01cult especially for high\ndimensional data. In this paper, we present a novel procedure, called\nFloatBoost, for learning a better boosted classi\ufb01er. FloatBoost uses a\nbacktrack mechanism after each iteration of AdaBoost to remove weak\nclassi\ufb01ers which cause higher error rates. The resulting \ufb02oat-boosted\nclassi\ufb01er consists of fewer weak classi\ufb01ers yet achieves lower error rates\nthan AdaBoost in both training and test. We also propose a statistical\nmodel for learning weak classi\ufb01ers, based on a stagewise approximation\nof the posterior using an overcomplete set of scalar features. Experi-\nmental comparisons of FloatBoost and AdaBoost are provided through a\ndif\ufb01cult classi\ufb01cation problem, face detection, where the goal is to learn\nfrom training examples a highly nonlinear classi\ufb01er to differentiate be-\ntween face and nonface patterns in a high dimensional space. The results\nclearly demonstrate the promises made by FloatBoost over AdaBoost.\n\n1 Introduction\n\nNonlinear classi\ufb01cation of high dimensional data is a challenging problem. While design-\ning such a classi\ufb01er is dif\ufb01cult, AdaBoost learning methods, introduced by Freund and\nSchapire [3], provides an effective stagewise approach: It learns a sequence of more easily\nlearnable \u201cweak classi\ufb01ers\u201d, and boosts them into a single strong classi\ufb01er by a linear com-\nbination of them. It is shown that the AdaBoost learning minimizes an upper error bound\nwhich is an exponential function of the margin on the training set [14].\n\nBoosting learning originated from the PAC (probably approximately correct) learning the-\nory [17, 6]. Given that weak classi\ufb01ers can perform slightly better than random guessing\n\n\u0002 http://research.microsoft.com/\n\u0004 The work presented in this paper was carried out at Microsoft Research Asia.\n\nszli\n\n\u0003\n\fon every distribution over the training set, AdaBoost can provably achieve arbitrarily good\nbounds on its training and generalization errors [3, 15]. It is shown that such simple weak\nclassi\ufb01ers, when boosted, can capture complex decision boundaries [1].\n\nRelationships of AdaBoost [3, 15] to functional optimization and statistical estimation are\nestablished recently. A number of gradient boosting algorithms are proposed [4, 8, 21]. A\nsigni\ufb01cant advance is made by Friedman et al. [5] who show that the AdaBoost algorithms\nminimize an exponential loss function which is closely related to Bernoulli likelihood.\n\nIn this paper, we address the following problems associated with AdaBoost:\n\n1. AdaBoost minimizes an exponential (some another form of ) function of the mar-\ngin over the training set. This is for convenience of theoretical and numerical\nanalysis. However, the ultimate goal in applications is always minimum error\nrate. A strong classi\ufb01er learned by AdaBoost may not necessarily be best in this\ncriterion. This problem has been noted, eg by [2], but no solutions have been\nfound in literature.\n\n2. An effective and tractable algorithm for learning weak classi\ufb01ers is needed. Learn-\ning the optimal weak classi\ufb01er, such as the log posterior ratio given in [15, 5],\nrequires estimation of densities in the input data space. When the dimensionality\nis high, this is a dif\ufb01cult problem by itself.\n\nWe propose a method, called FloatBoost (Section 3), to overcome the \ufb01rst problem. Float-\nBoost incorporates into AdaBoost the idea of Floating Search originally proposed in [11]\nfor feature selection. A backtrack mechanism therein allows deletion of those weak classi-\n\ufb01ers that are non-effective or unfavorable in terms of the error rate. This leads to a strong\nclassi\ufb01er consisting of fewer weak classi\ufb01ers. Because deletions in backtrack is performed\naccording to the error rate, an improvement in classi\ufb01cation error is also obtained. To solve\nthe second problem above, we provide a statistical model (Section 4) for learning weak\nclassi\ufb01ers and effective feature selection in high dimensional feature space. A base set of\nweak classi\ufb01ers, de\ufb01ned as the log posterior ratio, are derived based on an overcomplete\nset of scalar features. Experimental results are presented in (Section 5) using a dif\ufb01cult\nclassi\ufb01cation problem, face detection. Comparisons are made between FloatBoost and Ad-\naBoost in terms of the error rate and complexity of boosted classi\ufb01er. Results clear show\nthat FloatBoost yields a strong classi\ufb01er consisting of fewer weak classi\ufb01ers yet achieves\nlower error rates.\n\n2 AdaBoost Learning\n\nIn this section, we give a brief description of AdaBoost algorithm, in the notion of Real-\nBoost [15, 5], as opposed to the original discrete AdaBoost [3].\n\nFor\n\ntwo class problems,\n\na set of\n\n\u0001\u0003\u0002\u0005\u0004\u0007\u0006\t\b\n\u0004\f\u000b\f\u0006\u000e\r\u000f\r\u000e\r\u000e\u0006\u0010\u0001\u0003\u0002\u0012\u0011\u0013\u0006\t\b\u0014\u0011\u0015\u000b , where \b\u0017\u0016\u0019\u0018\u001b\u001a\u0007\u001c\u001e\u001d\u0017\u0006\u000e\u001f\u0019\u001d! \n\u0002\"\u0016#\u0018%$'&\n\n. A stronger classi\ufb01er is a linear combination of ( weak classi\ufb01ers\n\nlabelled training examples is given as\nis the class label associated with example\n\n(1)\n\n)+*\n\n\u0001\u0003\u0002,\u000b.-\n\n\u00015\u0002\u0012\u000b\n\n\u000443\n\n021\n\nIn this real version of AdaBoost, the weak classi\ufb01ers can take a real value, 3\nand have absorbed the coef\ufb01cients needed in the discrete version (there, 3\nThe class label for \u0002\n\nis obtained as )\n\n\u0001\u0003\u0002,\u000b:-<;\t=?>\u0014@BA\n\nindicates the con\ufb01dence. Every training example is associated with a weight. During the\nlearning process, the weights are updated dynamically in such a way that more emphasis is\n\n\u0001\u0003\u0002\u0012\u000bED while the magnitude F\n\n,\n\n\u0001\u0003\u0002,\u000b+\u00186$\n\u0001\u0003\u0002,\u000b7\u00188\u001f\u0019\u001d\u0014\u00069\u001c\u001e\u001d ).\n\u0001\u0003\u0002,\u000b\n)C*\n\n)C*\n\n\n*\n/\n0\n0\n0\nF\n\fplaced on hard examples which are erroneously classi\ufb01ed previously. It is important for the\noriginal AdaBoost. However, recent studies [4, 8, 21] show that the arti\ufb01cial operation of\nexplicit re-weighting is unnecessary and can be incorporated into a functional optimization\nprocedure of boosting.\n\n0. (Input)\n\n/10\n\n3. (Output)\n\n1. (Initialization)\n\naccording to Eq.4;\n\n2. (Forward Inclusion)\n\n(1) Training examples\u0002\u0001\u0004\u0003\u0006\u0005\b\u0007\n\t\f\u000b\u000e\r\u000f\t\u0011\u0010\u0012\u000b\u0014\u0013\f\u0013\u0014\u0013\u0014\u000b\f\u0005\b\u0007\u0016\u0015\u0017\u000b\u000e\r\u0018\u0015\u0019\u0010\u0011\u001a ,\nwhere\u001b\u001c\u0001\u001e\u001d\u0019\u001f! ; of which\u001d examples have\r#\"\n\u0001\u0004\u001f%$\nand examples have\r\n\u0001'&\u0017$ ;\n(2) The maximum number(*),+.- of weak classi\ufb01ers to be combined;\n/1032\u00114\n576\n/1032\u00114\nfor those examples with\r#\"8\u00019\u001f%$ or\n5.:\nfor those examples with\r\n\u0001'&\u0017$ .\n(;\u0001=< ;\n),+.-\nwhile(?>@(\n(1)(;AB(C\u001f@$ ;\n(2) ChooseD\u0016E\n(3) Update/10\nER\u0005\b\u0007\nAGFIHKJMLN&O\r\n\u0010TS , and normalize toU\n\"QP\n\u0005\b\u0007\u0016\u0010TS .\n_a`\n\u0005\b\u0007W\u0010,\u0001\u001eX\u000eY[Z]\\^L\n\u00015\u0002\u0012\u000b*b\n\u0001\u0003\u0002,\u000bRced . The \u201cmargin\u201d of an example \u0001\u0003\u0002\n\b4\u000b\n, or \b\nAn error occurs when )\n)C*\n\u0001\u0003\u0002,\u000b . This can be consid-\n\u0001\u0003\u0002,\u000b7\u0018%$ on the training set examples is de\ufb01ned as \b\nachieved by 3\nered as a measure of the con\ufb01dence of the 3 \u2019s prediction. The upper bound on classi\ufb01cation\nerror achieved by )+*\n\u0016Vg\u0006h^i\u0014jQkal\n\u00015\u0002\u0012\u000b\nfor the new strong classi\ufb01er )\n\u00015\u0002\u0012\u000b\n\n\u000b#-\n\u00015\u0002\u0012\u000b by stage-wise minimization of Eq.(2). Given the current\n\u0001\u0003\u0002,\u000b , the best 3\n-ts\u0006u\nw#x\n\u00015\u0002\u0012\u000b\n\ncan be derived as the following exponential loss function [14]\n\nis the one which leads to the minimum cost\n\nIt is shown in [15, 5] that the minimizer is\n\nFigure 1: RealBoost Algorithm.\n\nAdaBoost construct\n\n\u0001V$ ;\n\n-rq\n\n\u0004\u0014\u0001\u0003\u0002\u0012\u000b\n\u0004\u0014\u0001\u0003\u0002\u0012\u000b\n\n071\n\u0001\u0003\u0002\u0012\u000b\n\nmon\n\n>av\n\ny{zo|\n\n\u00015\u0002\u0012\u000b\t\u000b\n\u0006\u0080\u007f\n\u001c\u001e\u001d\n\u0001\u0003\b+-\n>~}\n\u0006\u0080\u007f\n\u0001\u0003\b+-\u001b\u001f\u0019\u001d\np are the weights given at time (\n\u00015\b\n. Using}\n\u001c\u001e\u001d\u0017\u0006\u0080\u007f\n\b+-\n>\u0085\u0084\n\b+-\u001b\u001f\u0019\u001d\u0017\u0006\u0080\u007f\ny\u0082z\u0083|\n\u0001\u0003\b+-\n\u001c\u001e\u001d\n>~}\n\u000b\u0018\u0088\nzo|\n\u0001\u0003\b+-\u001b\u001f\u0019\u001d\n\u0001\u0003\u0002,\u000b\n\n\u00812*\n\n\u00015\u0002\u0012\u000b\n\n\u0001\u0003\u0002\u0012\u000b\n\n\u0001\u0003\u0002\n\u0001\u0003\u0002\n\nwhere\u007f\n\nand letting\n\nwe arrive\n\n\u0001\u0003\u0002,\u000b\n\n(2)\n\n(3)\n\n(4)\n\n\u0001\u0003\b4\u000b\n\n(5)\n\n(6)\n\n\u0006\u0080\u007f\n\n\u0001\u0003\u0002\n\n\b,\u0006\u0080\u007f\n\nThe half log likelihood ratio\u0081\nand the threshold\u0086\n\n(7)\nis learned from the training examples of the two classes,\ncan be adjusted\n\n\u0001\u0003\u0002\u0012\u000b\n\nis determined by the log ratio of prior probabilities.\u0086\n\n\"\n\"\n\u0001\n\t\n\"\n\u0001\n\t\n\"\nE\n4\n\"\n\"\n\"\nE\n4\n\"\nP\nU\nE\n\t\nD\n_\n-\n\b\n\u0006\n3\nf\n\u0001\n)\n*\n/\nj\np\n3\n)\n*\nh\n*\nh\n\u0004\n\u0004\n3\n0\n*\n*\n-\n)\n*\nh\n\u001c\n3\n*\n3\n*\n=\n@\nf\n\u0001\n)\n*\nh\n\u0004\n\u001c\n3\n\u0001\n3\n*\n-\n\u001d\nF\n\u0002\nm\n*\nh\n\u0004\np\n\u000b\n}\nF\n\u0002\nm\n*\nh\n\u0004\np\n\u000b\nm\n*\nh\n\u0004\nF\n\u0002\n\u000b\n-\n}\nF\n\u000b\n}\n-\n\u001d\nF\n\u000b\n\u0084\nF\n\u000b\n\u0086\n-\n\u001d\ny\n\u0087\n\u000b\n}\n3\n*\n-\n\u0081\n*\n\u001c\n\u0086\n\f\u00015\u0002\n\n\u000b .\n\n\b,\u0006\u0080\u007f\n\nto balance between detection rate and false alarm (ROC curve). The algorithm is shown\nin Fig.1 (Note: Re-weight formula in this description is equivalent to the multiplicative\nrule in the original form of AdaBoost [3, 15]). In Section 4, we will present an model for\n\n3 FloatBoost Learning\n\napproximating}\nFloatBoost backtracks after the newest weak classi\ufb01er 3\nweak classi\ufb01ers 3\n\nis added and delete unfavorable\nfrom the ensemble (1), following the idea of Floating Search [11].\nFloating Search [11] is originally aimed to deal with non-monotonicity of straight sequen-\ntial feature selection, non-monotonicity meaning that adding an additional feature may lead\nto drop in performance. When a new feature is added, backtracks are performed to delete\nthose features that cause performance drops. Limitations of sequential feature selection\nis thus amended, improvement gained with the cost of increased computation due to the\nextended search.\n\n0. (Input)\n\nFigure 2: FloatBoost Algorithm.\n\nweighted sum of missing rate and false alarm rate which is usually the criterion in one-class\n\n be the so-far-best set\n\u0001\u0003\u0002,\u000b (or a\nbe the minimum error rate achieved so far with an ensemble of $\n\n\u0004\u0010\u0006\u000e\r\u000f\r\u000e\r\u000f\u0006\n\n\u00015\u0002\u0012\u000b.-\n\n021\n\nIn Step 2 (forward inclusion), given already selected, the best weak classi\ufb01er is added one\n\n.\n\n(2) \u0001\n\n(4) \u0006\n\n1. (Initialization)\n\naccording to Eq.4;\n\n2. (Forward Inclusion)\n\n3. (Conditional Exclusion)\n\n(3) The error rate \u0001\n\n(1) Training examples\u0002\u0001\u0004\u0003\u0006\u0005\b\u0007\n\t\f\u000b\u000e\r\u000f\t\u0011\u0010\u0012\u000b\u0014\u0013\f\u0013\u0014\u0013\u0014\u000b\f\u0005\b\u0007\u0016\u0015\u0017\u000b\u000e\r\u0018\u0015\u0019\u0010\u0011\u001a ,\nwhere\u001b\u001c\u0001\u001e\u001d\u0017\u001f\n ; of which\u001d examples have\n\r\u0018\"8\u00019\u001f%$ and examples have\r\u0018\"^\u0001'&\u0017$ ;\n(2) The maximum number(*),+.- of weak classi\ufb01ers;\n\u0010 , and the acceptance threshold \u0001\n(1)/\n032\u00114\n576\n\u00019\u001f%$ or\nfor those examples with\r\n032\u00114\n5.:\nfor those examples with\r\u0018\"8\u0001\n&\u0017$ ;\n)\u0003\u0002\n),+.- ),\n\u0001 max-value (for \u0005\n\u0001'$\n\u000b\f\u0013I\u0013\u0014\u0013\u0014\u000b\u0080(\n(;\u0001=< , \u0006\n\u0001\u0004\u0003\n\u001a .\n(1)(;AB(C\u001f@$ ;\n(2) ChooseD\u0016E\n(3) Update/\n\u0010TS , and normalize toU\nER\u0005\b\u0007\nAGFIHKJMLN&O\r\n\"QP\n)\u0003\u0002\n)\u0003\u0002\n\u001a ; If \u0001\nE~\u0010 , then \u0001\nE\b\u0007\n\t\n\tR\u0003\nD\u0016E\n(1)D\r\f\u0016\u0001\u000f\u000e\u0011\u0010\u0011Z\u0013\u0012\nY[\\\r\u0014\u0016\u0015\u0016\u0017\n&*D\u0016\u0010 ;\n)\u0003\u0002\n&*D\nE\b\u00078\t , then\nE\b\u00078\t\n&*D\n)\u0003\u0002\n$ ;\n&*D\n\u0010 ;(;\u0001\u0004(\nE\u0018\u0007\n\t\nD ;\n\u00019U\n(b)P\n(a) if(;\u00019(*),+.- or \u0019\nE~\u0010a>\u001a\u0019\n(b)/\nER\u0005\b\u0007\nF\u0012H\u000fJMLN&O\r\n\u0010TS ; goto 2.(1);\n\"QP\n0\u001c\u001b\nD\n\u0005\b\u0007\u0016\u0010TS .\n\u0005\b\u0007W\u0010,\u0001\u001eX\u000eY[Z]\\^L\nThe FloatBoost procedure is shown in Fig.2 Let \u001d\n\u000b be the error rate achieved by )\nof ( weak classi\ufb01ers; \u001e\ndetection problem); \u001e \u001f\"!\n\nweak classi\ufb01ers.\n\n(2) If \u0001\n(a) \u0006\n\n\u0002 , then goto 4;\n\n4. (Output)\n\n(c) goto 3.(1);\n\n\u0014\u0016\u0015\u0016\u0017\n\n(3) else\n\n\f ;\n\n\u0015\u0016\u0017\n\n\u0001V$ ;\n\u0010 ;\n\nF\nm\n*\nh\n\u0004\np\n*\n0\n\u0005\nP\nE\n\u0002\n\"\n\u0001\n\t\n\"\n/\n\"\n\u0001\n\t\n\u0004\n_\n2\n0\nE\n4\n\"\n\"\n\"\n/\n0\nE\n4\n\"\nE\n\u0001\n\u0006\n\u0004\nE\n\u000b\n\u0001\n\u0005\nP\n\u0004\nE\n\u0001\n\u0001\n\u0005\nP\nE\nl\n\u0001\n\u0005\nP\nE\n\u0005\nP\nE\n\f\n\u0010\n>\n\u0001\n\u0004\n\u0001\n\u0006\nE\n\u0001\n\u0004\n\u0001\n\u0001\n\u0005\nP\nE\n\f\n&\nE\nl\n\u0005\n\u0006\n0\nE\n4\n\"\nA\n\"\nP\nU\n\u0014\n4\nl\n*\n-\n\u001a\n3\n3\n*\n\u0001\n)\n*\n*\nq\n*\n\u0004\n3\n0\n#\n0\n\f\u001f\"!\n\nat a time, which is the same as in AdaBoost. In Step 3 (conditional exclusion), FloatBoost\n, subject to the condition that the\n\n\u0004 . These are repeated until no more removals can\n\nremoves the least signi\ufb01cant weak classi\ufb01er from \u001d\nremoval leads to a lower error rate \u001e\nbe done. The procedure terminates when the risk on the training set is below f\nmaximum number (\nthe same error rate \u001e .\n4 Learning Weak Classi\ufb01ers\n\nIncorporating the conditional exclusion, FloatBoost renders both effective feature selection\nand classi\ufb01er learning. It usually needs fewer weak classi\ufb01ers than AdaBoost to achieve\n\nis reached.\n\n or the\n\n\u001f\u0001\n\n\u0001\u0003\u0002,\u000b\n\n\u0006\u000f\r\u000e\r\u000e\r\u000e\u0006\n\n\u0006\u000e\r\u000f\r\u000e\r\f\u0006\n\nfeatures\n\n\u000b\u0012\u0011\u0013\u0011\u0014\u0011\n\n\u0006\u0012\u007f\n\n\u0001\u0003\u0002,\u000b\n\n\u0006\u000e\r\u000f\r\u000e\r\u000f\u0006\n\nm\r\fIp\n\u00015\u0002\n\n\u0006\u000e\r\u000e\r\u000f\r\u000e\u0006\n\b,\u0006\u0080\u007f\n\u000b\u000f\u000e\n\n. A dictionary of \b\n\n\u0006\u000f\r\u000e\r\u000e\r\u000e\u0006\n\b,\u0006\n\u0006\u000f\r\u000e\r\u000e\r\u000e\u0006\n\nThe section presents a method for computing the log likelihood in Eq.(5) required in learn-\ning optimal weak classi\ufb01ers. Since deriving a weak classi\ufb01er in high dimensional space is\na non-trivial task, here we provide a statistical model for stagewise learning of weak classi-\nis computed by a transform from\n. A feature can be the coef\ufb01cient\nof, say, a wavelet transform in signal and image processing. If projection pursuit is used as\ncandidate scalar\nto denote\n\n\ufb01ers based on some scalar features. A scaler feature \u0003 of\u0002\n\u00015\u0002\u0012\u000b:\u00188$\nthe \u0004 -dimensional data space to the real line, \u0005\u0007\u0006\nis simply the \u0003 -th coordinate of \u0002\nthe transform, \u0005\n . In the following, we use \u0005\n\u0001\u0003\u0002\u0012\u000b\nfeatures can be created \t\n\u0005\u000b\n\n\u0001\u0003\u0002,\u000b\nis the feature computed from \u0002 using the\n-th stage, while \u0005\u0007\u0006\nthe feature selected in the $\n\u0003 -th transform.\nAssuming that \t\noptimal weak classi\ufb01er (7) can be designed in the following way: First, at stage ( where\np ,\n\u001f\u0019\u001d features \u0005\nwe can approximate\u0084\n\b,\u0006\u0080\u007f\n\np have been selected and the weight is given as\u007f\n\u000b by using the distributions of (\n\b,\u0006\u0080\u007f\nm\u0010\f\u0012p\n\b,\u0006\u0080\u007f\n\b\u0012\u0006\n\u0006\u0080\u007f\n\fIp\n\b\u0012\u0006\n\nis an over-complete basis, a set of candidate weak classi\ufb01ers for the\n\nis an over-complete basis set, the approximation is good enough for large\n\nOn the right-hand side of the above equation, all conditional densities are \ufb01xed except the\nis to choose the\n\nBecause \t\nenough (\nis actually\u0084\nNote that\u0084\ntains the information about entire history of \u007f\n\u0006\u000e\r\u000e\r\u000f\r\f\u0006\nm\u0016\u0015\u0014p\n\b,\u0006\u0080\u007f\n\b,\u0006\u0080\u007f\n\b,\u0006\u0080\u007f\n\nand when the (\n\u000b because\u007f\n\b,\u0006\np . Therefore, we have\n\b,\u0006\u0012\u007f\n\u000b\u000f\u000e\n\u000b\u0012\u0011\u0013\u0011\u0014\u0011\n\b,\u0006\u0080\u007f\n\b,\u0006\u0080\u007f\n\u000b . Learning the best weak classi\ufb01er at stage (\nlast one\u0084\n\u0005\u0017\u0006+F\n\u0006 such thatf\np for \u0005\nbest feature \u0005\n\u001c\u001e\u001d\nThe conditional probability densities\u0084\n\u001f\u0019\u001d can be estimated using the histograms computed from the\nand the negative class \b\np . Let\nweighted voting of the training examples using the weights\u007f\n\u001c\u001e\u001d\u0014\u0006\u0012\u007f\n>1\u0084\n\u001f\u0019\u001d\u0014\u0006\u0012\u007f\ny{z\u0083|\n\n\u0006\u0012\u007f\n\b,\u0006\u0080\u007f\n\b,\u0006\u0080\u007f\n\u0005\u0017\u0006\u0013F\n\nfor the positive class \b\n\nand accounts for the dependencies on\n\nfeatures are chosen appropriately.\n\np con-\n\n(10)\n\n(11)\n\nis minimized according to Eq.(3).\n\nm\r\fIp\n\nm\u0010\f\u0012p\n\fIp\n\n\u0001\u0003\u0002,\u000b.-\n\n\u0005\u0017\u0006\n\u0005\u0017\u0006\n\n\u0001\u0003\u0002\n\n\u0001\u0003\u0002\n\n(8)\n\n(9)\n\n(12)\n\n*\n#\n*\nh\n\u0002\n\u0006\n-\n\u001a\n\u0005\n\u0004\nm\n0\np\n(\nm\n\u0004\np\n\u0006\n\u0005\n\u0005\nm\n*\nh\n\u0004\nm\n*\nh\n\u0004\nF\nm\n*\nh\n\u0004\np\n\u0084\nF\nm\n*\nh\n\u0004\np\n\u0084\n\u0001\n\u0005\nm\n\u0004\np\n\u0006\n\u0005\n\u0005\nm\n*\nh\n\u0004\np\n\u0006\n\u0005\n\u0006\nF\nm\n*\nh\n\u0004\np\n\u000b\n-\n\u0084\n\u0001\n\u0005\nm\n\u0004\np\nF\nm\n*\nh\n\u0004\np\n\u000b\n\u0084\n\u0001\n\u0005\nF\n\u0005\nm\n\u0004\np\nm\n*\nh\n\u0004\np\n\u0084\n\u0001\n\u0005\nm\n*\nh\n\u0004\np\nF\n\u0005\nm\n\u0004\np\n\u0005\nm\n*\nh\nm\n*\nh\n\u0004\np\n\u000b\n\u0084\n\u0001\n\u0005\n\u0006\nF\n\u0005\nm\n\u0004\np\n\u0005\nm\n*\nh\n\u0004\n\u000b\nm\n*\nh\n\u0004\np\n\u000b\n\u0001\n\u0005\nm\n0\np\nF\n\u0005\nm\n\u0004\np\n\u0005\nm\n0\nh\n\u0004\np\n\u000b\n\u0001\n\u0005\nm\n0\np\nF\nm\n0\nh\n\u0004\np\nm\n0\n\u0005\nm\n\u0004\np\n\u0005\nm\n0\nh\n\u0004\n\u0084\nF\nm\n*\nh\n\u0004\np\n\u0084\n\u0001\n\u0005\nm\n\u0004\np\nF\n\u000b\n\u0084\n\u0001\n\u0005\nF\nm\n\u0004\np\n\u0084\n\u0001\n\u0005\nm\n*\nh\n\u0004\np\nF\nm\n*\nh\n\u000b\n\u0084\n\u0001\nm\n*\nh\n\u0004\np\n\u000b\n\u0001\nm\n*\nh\n\u0004\np\nm\n*\n\u0001\n\u0005\n\u0006\nF\nm\n*\nh\n\u0004\np\n\u000b\n-\n-\nm\n*\nh\n\u0004\n\u0081\nm\n*\np\n\u0006\n\u001d\n\u0001\nF\n\b\n-\nm\n*\nh\n\u0004\np\n\u000b\n\u0084\n\u0001\nF\n\b\n-\nm\n*\nh\n\u0004\np\n\u000b\n\f\u0001\u0003\u0002\u0012\u000b\n\n(13)\n\n\u0001\u0003\u0002,\u000b\n\n. We can derive the set of candidate weaker classi\ufb01ers as\n\nand 3\n\n\u0001\u0003\u0002,\u000b.-\u0001\n\n\u0086\u0003\u0002\nF\u0005\u0004\n\u0001\u0003\u0002,\u000b among all in \u001d\np for the new strong classi\ufb01er )\nis given by Eq.(3) among all 3\n\u0004\u000e\u000b where\n\u0006\u0014\u0011\u0014\u0011\u0013\u0011\u0005\u0006\n\b\n\nRecall that the best 3\n\u0004\u0014\u0001\u0003\u0002\u0012\u000b\n\u00015\u0002\u0012\u000b\n[4, 8, 21], we can choose the optimal weak classi\ufb01er by \ufb01nding the 3\ngradient \u001f\u0007\u0006\n\nweak classi\ufb01er has been derived as (7). According the theory of gradient based boosting\nthat best \ufb01ts the\n\n\u00015\u0002\u0012\u000b\u0019-\np , for which the optimal\n\nin direction and then scaling it so that the two has the same\nso that the\n\nIn our stagewise approximation formulation, this can be done by \ufb01rst \ufb01nding the 3\n(re-weighted) norm. An alternative selection scheme is simply to choose \u0003\n\u001c\u001e\u001d\u0017\u0006\u0080\u007f\nerror rate (or some risk), computed from the two histograms\u0084\n\nthat best \ufb01ts \u001f\u0007\u0006\n-\u001b\u001f\u0019\u001d\u0017\u0006\u0080\u007f\n\n\u000b , is minimized.\n\n5 Experimental Results\n\n\u0005\u0017\u0006\n\n\u0001\t\b\n\n\u0002,\u0004\n\n(14)\n\n\u0001\u0003\u0002,\u000b7\u0018\n\u000b and\n\n\u0001\u0003\u0002\u0012\u000b\n\n)+*\n\n\u000b.-\n\n\u0002\"\u0011\n\nFace Detection The face detection problem here is to classi\ufb01er an image of standard size\n(eg 20x20 pixels) into either face or nonface (imposter). This is essentially a one-class\nproblem in that everything not a face is a nonface. It is a very hard problem. Learning\nbased methods have been the main approach for solving the problem , eg [13, 16, 9, 12].\nExperiments here follow the framework of Viola and Jones [19, 18]. There, AdaBoost is\nused for learning face detection; it performs two important tasks: feature selection from a\nlarge collection features; and constructing classi\ufb01ers using selected features.\nData Sets\n\nA set of 5000 face images are collected from various sources. The faces are cropped and\nre-scaled to the size of 20x20. Another set of 5000 nonface examples of the same size are\ncollected from images containing no faces. The 5000 examples in each set is divided into\na training set of 4000 examples and a test set of 1000 examples. See Fig.3 for a random\nsample of 10 face and 10 nonface examples.\n\nFigure 3: Face (top) and nonface (bottom) examples.\n\nfor constructing weak classi\ufb01ers. These block differences are an extended set of steerable\nfor admissible\n\nScalar Features\nThree basic types of scalar features \u0005\n\ufb01lters used in [10, 20]. There are hundreds of thousands of different \u0005\n\u0006\t\b,\u0006\f\u000b\u0017\u0002\n(12) computed from the two histograms\u0084\nface (\b\n\n\u0006 are derived from each example, as shown in Fig.4,\n\u0006\f\u000b\u0017\b values. Each candidate weak classi\ufb01er is constructed as the log likelihood ratio\n\u001c\u001e\u001d ) and nonface (\b\n\u001f\u0019\u001d ) examples (cf. the last part of the previous section).\n\n\u000b of a scalar feature \u0005\n\n\b,\u0006\u0080\u007f\n\nfor the\n\nm\n*\np\n\u0006\n\u0081\nm\n*\np\n\u0006\n\u001f\n\u001d\nm\n*\np\n-\n\u001a\n3\nm\n*\np\n\u0006\n\u0003\n \n*\nm\n*\nh\n\u0004\n*\n)\n*\nh\n\u001c\n3\n*\n\u0001\n\u0018\n\u001d\nm\n*\n*\nf\n\u0001\n)\n*\nh\n\u0006\nf\n\u0001\nh\n\u0004\nf\n\b\nf\n\b\n\u000b\n*\n\u001d\nm\n*\np\nf\n\u0001\n\u0005\n\u0006\nF\n\b\n-\nm\n*\nh\n\u0004\np\n\u0084\n\u0001\nF\n\b\nm\n*\nh\n\u0004\np\n\u0006\n\u0002\n\u0001\n\u0005\n\u0006\nF\nm\n*\nh\n\u0004\np\n\u0006\n-\n-\n\fFigure 4: The three types of simple Harr wavelet like features \u0002\nm\u0001\u0012p de\ufb01ned on a sub-window\n\u000b apart. Each feature takes\n\u0006\f\u000b\u0017\b\n. The rectangles are of size \u0002\u0003\u0002:\b and are at distances of \u0001\n\u000b\u0014\u0002\n\na value calculated by the weighted (\n\n\u001d\u0014\u0006\n\ny ) sum of the pixels in the rectangles.\n\nPerformance Comparison The same data sets are used for evaluating FloatBoost and Ad-\naBoost. The performance is measured by false alarm error rate given the detection rate\n\ufb01xed at 99.5%. While a cascade of stronger classi\ufb01ers are needed to achiever very low\nfalse alarm [19, 7], here we present the learning curves for the \ufb01rst strong classi\ufb01er com-\nposed of up to one thousand weak classi\ufb01ers. This is because what we aim to evaluate\nhere is to contrast between FloatBoost and AdaBoost learning algorithms, rather than the\nsystem work. Interested reader is referred to [7] for a complete system which achieved a\n\u0005 with the detection rate of 95%. (A live demo of multi-view face de-\ntection system, the \ufb01rst real-time system of the kind in the world, is being submitted to the\nconference).\n\nfalse alarm of \u001d]d\n\nAdaBoost\u2212train \nAdaBoost\u2212test \nFloatBoost\u2212train \nFloatBoost\u2212test \n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n0.6\n\n0.55\n\n0.5\n\n0.45\n\n0.4\n\ns\ne\n\nt\n\na\nR\n\n \nr\no\nr\nr\n\nE\n\n0.35\n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n700\n\n800\n\n900\n\n1000\n\n# Weak Classifiers\n\nFigure 5: Error Rates of FloatBoost vs AdaBoost for frontal face detection.\n\nThe training and testing error curves for FloatBoost and AdaBoost are shown in Fig.5,\nwith the detection rate \ufb01xed at 99.5%. The following conclusions can be made from these\ncurves: (1) Given the same number of learned features or weak classi\ufb01ers, FloatBoost al-\nways achieves lower training error and lower test error than AdaBoost. For example, on the\ntest set, by combining 1000 weak classi\ufb01ers, the false alarm of FloatBoost is 0.427 versus\n0.485 of AdaBoost. (2) FloatBoost needs many fewer weak classi\ufb01ers than AdaBoost in\norder to achieve the same false alarms. For example, the lowest test error for AdaBoost\nis 0.481 with 800 weak classi\ufb01ers, whereas FloatBoost needs only 230 weak classi\ufb01ers\nto achieve the same performance. This clearly demonstrates the strength of FloatBoost in\n\n\u0002\n\u0004\nh\n\flearning to achieve lower error rate.\n\n6 Conclusion and Future Work\n\nBy incorporating the idea of Floating Search [11] into AdaBoost [3, 15], FloatBoost ef-\nfectively improves the learning results. It needs fewer weaker classi\ufb01ers than AdaBoost to\nachieve a similar error rate, or achieves lower a error rate with the same number of weak\nclassi\ufb01ers. Such a performance improvement is achieved with the cost of longer training\ntime, about 5 times longer for the experiments reported in this paper.\n\nThe Boosting algorithm may need substantial computation for training. Several methods\ncan be used to make the training more ef\ufb01cient with little drop in the training performance.\nNoticing that only examples with large weigh values are in\ufb02uential, Friedman et al. [5]\npropose to select examples with large weights, i.e.\nthose which in the past have been\nwrongly classi\ufb01ed by the learned weak classi\ufb01ers, for the training weak classi\ufb01er in t+- he\n\nnext round. Top examples within a fraction of \u001d\u0019\u001f\u0001 of the total weight mass are used,\nwhere 8\u00188A\nReferences\n[1] L. Breiman. \u201cArcing classi\ufb01ers\u201d. The Annals of Statistics, 26(3):801\u2013849, 1998.\n[2] P. Buhlmann and B. Yu. \u201cInvited discussion on \u2018Additive logistic regression: a statistical view of boosting (friedman, hastie\n\n\u001d\u0014\u0006Id4\r?\u001d\fD .\n\nand tibshirani)\u2019 \u201d. The Annals of Statistics, 28(2):377\u2013386, April 2000.\n\n[3] Y. Freund and R. Schapire. \u201cA decision-theoretic generalization of on-line learning and an application to boosting\u201d.Journal\n\nof Computer and System Sciences, 55(1):119\u2013139, Aug 1997.\n\n[4] J. Friedman. \u201cGreedy function approximation: A gradient boosting machine\u201d. The Annals of Statistics, 29(5), October\n\n2001.\n\n[5] J. Friedman, T. Hastie, and R. Tibshirani. \u201cAdditive logistic regression: a statistical view of boosting\u201d. The Annals of\n\nStatistics, 28(2):337\u2013374, April 2000.\n\n[6] M. J. Kearns and U. Vazirani. An Introduction to Computational Learning Theory. MIT Press, Cambridge, MA, 1994.\n[7] S. Z. Li, L. Zhu, Z. Q. Zhang, A. Blake, H. Zhang, and H. Shum. \u201cStatistical learning of multi-view face detection\u201d. In\n\nProceedings of the European Conference on Computer Vision, page ???, Copenhagen, Denmark, May 28 - June 2 2002.\n\n[8] L. Mason, J. Baxter, P. Bartlett, and M. Frean. Functional gradient techniques for combining hypotheses. In A. Smola,\nP. Bartlett, B. Sch\u00a8olkopf, and D. Schuurmans, editors, Advances in Large Margin Classi\ufb01ers, pages 221\u2013247. MIT Press,\nCambridge, MA, 1999.\n\n[9] E. Osuna, R. Freund, and F. Girosi. \u201cTraining support vector machines: An application to face detection\u201d. InCVPR, pages\n\n130\u2013136, 1997.\n\n[10] C. P. Papageorgiou, M. Oren, and T. Poggio. \u201cA general framework for object detection\u201d. InProceedings of IEEE Interna-\n\ntional Conference on Computer Vision, pages 555\u2013562, Bombay, India, 1998.\n\n[11] P. Pudil, J. Novovicova, and J. Kittler.\n\n(11):1119\u20131125, 1994.\n\n\u201cFloating search methods in feature selection\u201d. Pattern Recognition Letters,\n\n[12] D. Roth, M. Yang, and N. Ahuja. \u201cA snow-based face detector\u201d. InProceedings of Neural Information Processing Systems,\n\n2000.\n\n[13] H. A. Rowley, S. Baluja, and T. Kanade. \u201cNeural network-based face detection\u201d. IEEE Transactions on Pattern Analysis\n\nand Machine Intelligence, 20(1):23\u201328, 1998.\n\n[14] R. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. \u201cBoosting the margin: A new explanation for the effectiveness of voting\n\nmethods\u201d. The Annals of Statistics, 26(5):1651\u20131686, October 1998.\n\n[15] R. E. Schapire and Y. Singer. \u201cImproved boosting algorithms using con\ufb01dence-rated predictions\u201d. In Proceedings of the\n\nEleventh Annual Conference on Computational Learning Theory, pages 80\u201391, 1998.\n\n[16] K.-K. Sung and T. Poggio. \u201cExample-based learning for view-based human face detection\u201d.IEEE Transactions on Pattern\n\nAnalysis and Machine Intelligence, 20(1):39\u201351, 1998.\n\n[17] L. Valiant. \u201cA theory of the learnable\u201d.Communications of ACM, 27(11):1134\u20131142, 1984.\n[18] P. Viola and M. Jones. \u201cAsymmetric AdaBoost and a detector cascade\u201d. InProceedings of Neural Information Processing\n\nSystems, Vancouver, Canada, December 2001.\n\n[19] P. Viola and M. Jones. \u201cRapid object detection using a boosted cascade of simple features\u201d.\n\nInProceedings of IEEE\n\nComputer Society Conference on Computer Vision and Pattern Recognition, Kauai, Hawaii, December 12-14 2001.\n\n[20] P. Viola and M. Jones. \u201cRobust real time object detection\u201d. InIEEE ICCV Workshop on Statistical and Computational\n\nTheories of Vision, Vancouver, Canada, July 13 2001.\n\n[21] R. Zemel and T. Pitassi. \u201cA gradient-based boosting algorithm for regression problems\u201d. InAdvances in Neural Information\n\nProcessing Systems, volume 13, Cambridge, MA, 2001. MIT Press.\n\nd\n\nd\n\f", "award": [], "sourceid": 2266, "authors": [{"given_name": "Stan", "family_name": "Li", "institution": null}, {"given_name": "Zhenqiu", "family_name": "Zhang", "institution": null}, {"given_name": "Heung-yeung", "family_name": "Shum", "institution": null}, {"given_name": "Hongjiang", "family_name": "Zhang", "institution": null}]}