{"title": "Pass-efficient unsupervised feature selection", "book": "Advances in Neural Information Processing Systems", "page_first": 1628, "page_last": 1636, "abstract": "The goal of unsupervised feature selection is to identify a small number of important features that can represent the data. We propose a new algorithm, a modification of the classical pivoted QR algorithm of Businger and Golub, that requires a small number of passes over the data. The improvements are based on two ideas: keeping track of multiple features in each pass, and skipping calculations that can be shown not to affect the final selection. Our algorithm selects the exact same features as the classical pivoted QR algorithm, and has the same favorable numerical stability. We describe experiments on real-world datasets which sometimes show improvements of {\\em several orders of magnitude} over the classical algorithm. These results appear to be competitive with  recently proposed randomized algorithms in terms of pass efficiency and run time. On the other hand, the randomized algorithms may produce better features, at the cost of small probability of failure.", "full_text": "Pass-Ef\ufb01cient Unsupervised Feature Selection\n\nCrystal Maung\n\nDepartment of Computer Science\nThe University of Texas at Dallas\nCrystal.Maung@gmail.com\n\nHaim Schweitzer\n\nDepartment of Computer Science\nThe University of Texas at Dallas\nHSchweitzer@utdallas.edu\n\nAbstract\n\nThe goal of unsupervised feature selection is to identify a small number of impor-\ntant features that can represent the data. We propose a new algorithm, a modi\ufb01ca-\ntion of the classical pivoted QR algorithm of Businger and Golub, that requires a\nsmall number of passes over the data. The improvements are based on two ideas:\nkeeping track of multiple features in each pass, and skipping calculations that can\nbe shown not to affect the \ufb01nal selection. Our algorithm selects the exact same\nfeatures as the classical pivoted QR algorithm, and has the same favorable numer-\nical stability. We describe experiments on real-world datasets which sometimes\nshow improvements of several orders of magnitude over the classical algorithm.\nThese results appear to be competitive with recently proposed randomized algo-\nrithms in terms of pass ef\ufb01ciency and run time. On the other hand, the randomized\nalgorithms may produce more accurate features, at the cost of small probability of\nfailure.\n\n1\n\nIntroduction\n\nWork on unsupervised feature selection has received considerable attention. See, e.g., [1, 2, 3, 4,\nIn numerical linear algebra unsupervised feature selection is known as the column\n5, 6, 7, 8] .\nsubset selection problem, where one attempts to identify a small subset of matrix columns that can\napproximate the entire column space of the matrix. See, e.g., [9, Chapter 12]. The distinction\nbetween supervised and unsupervised feature selection is as follows. In the supervised case one\nis given labeled objects as training data and features are selected to help predict that label; in the\nunsupervised case nothing is known about the labels.\nWe describe an improvement to the classical Businger and Golub pivoted QR algorithm [9, 10]. We\nrefer to the original algorithm as the QRP, and to our improved algorithm as the IQRP. The QRP\nselects features one by one, using k passes in order to select k features. In each pass the selected\nfeature is the one that is the hardest to approximate by the previously selected features. We achieve\nimprovements to the algorithm run time and pass ef\ufb01ciency without affecting the selection and the\nexcellent numerical stability of the original algorithm. Our algorithm is deterministic, and runs in a\nsmall number of passes over the data. It is based on the following two ideas:\n\n1. In each pass we identify multiple features that are hard to approximate with the previously\nselected features. A second selection step among these features uses an upper bound on\nunselected features that enables identifying multiple features that are guaranteed to have\nbeen selected by the QRP. See Section 4 for details.\n\n2. Since the error of approximating a feature can only decrease when additional features are\nadded to the selection, there is no need to evaluate candidates with error that is already \u201ctoo\nsmall\u201d. This allows a signi\ufb01cant reduction in the number of candidate features that need to\nbe considered in each pass. See Section 4 for details.\n\n1\n\n\f2 Algorithms for unsupervised feature selection\n\nThe algorithms that we consider take as input large matrices of numeric values. We denote by m\nthe number of rows, by n the number of columns (features), and by k the number of features to be\nselected. Criteria for evaluating algorithms include their run time and memory requirements, the\nnumber of passes over the data, and the algorithm accuracy. The accuracy is a measure of the error\nof approximating the entire data matrix as a linear combination of the selection. We review some\nclassical and recent algorithms for unsupervised feature selection.\n\n2.1 Related work in numerical linear algebra\n\nBusinger and Golub QRP was established by Businger and Golub [9, 10]. We discuss it in detail\nin Section 3. It requires k passes for selecting k features, and its run time is 4kmn \u2212 2k2(m + n) +\n4k3/3. A recent study [11] compares experimentally the accuracy of the QRP as a feature selection\nalgorithm to some recently proposed state-of-the-art algorithms. Even though the accuracy of the\nQRP is somewhat below the other algorithms, the results are quite similar. (The only exception was\nthe performance on the Kahan matrix, where the QRP was much less accurate.)\n\nGu and Eisenstat algorithm [1] was considered the most accurate prior to the work on randomized\nalgorithms that had started with [12]. It computes an initial selection (typically by using the QRP),\nand then repeatedly swaps selected columns with unselected column. The swapping is done so that\nthe product of singular values of the matrix formed by the selected columns is increased with each\nswapping. The algorithm requires random access memory, and it is not clear how to implement it\nby a series of passes over the data. Its run time is O(m2n).\n\n2.2 Randomized algorithms\n\nRandomized algorithms come with a small probability of failure, but otherwise appear to be more\naccurate than the classical deterministic algorithms. Frieze et al [12, 13] have proposed a randomized\nalgorithm that requires only two passes over the data. This assumes that the norms of all matrix\ncolumns are known in advance, and guarantees only an additive approximation error. We discuss\nthe run time and the accuracy of several generalizations that followed their studies.\n\nVolume sampling Deshpande et al [14] have studied a randomized algorithm that samples k-tuples\nof columns with probability proportional to their \u201cvolume\u201d. The volume is the square of the product\nof the singular values of the submatrix formed by these columns. They show that this sampling\nscheme gives rise to a randomized algorithm that computes the best possible solution in the Frobe-\nnius norm. They describe an ef\ufb01cient O(kmn) randomized algorithm that can be implemented in k\npasses and approximates this sampling scheme. These results were improved (in terms of accuracy)\nin [15], by computing the exact volume sampling. The resulting algorithm is slower but much more\naccurate. Further improvements to the speed of volume sampling in [6] have reduced the run time\ncomplexity to O(km2n). As shown in [15, 6], this optimal (in terms of accuracy) algorithm can also\nbe derandomized, with a deterministic run time of O(km3n).\n\nLeverage sampling The idea behind leverage sampling is to randomly select features with prob-\nability proportional to their \u201cleverage\u201d. Leverage values are norms of the rows of the n \u00d7 k right\neigenvector matrix in the truncated SVD expansion of the data matrix. See [16, 2]. In particular,\nthe \u201ctwo stage\u201d algorithm described in [2] requires only 2 passes if the leverage values are known.\nIts run time is dominated by the calculation of the leverage values. To the best of our knowledge\nthe currently best algorithms for estimating leverage values are randomized [17, 18]. One run takes\n2 passes and O(mn log n + m3) time. This is dominated by the mn term, and [18] show that it\ncan be further reduced to the number of nonzero values. We note that these algorithms do not com-\npute reliable leverage in 2 passes, since they may fail at a relatively high (e.g., 1/3) probability. As\nstated in [18] \u201cthe success probability can be ampli\ufb01ed by independent repetition and taking the\ncoordinate-wise median\u201d. Therefore, accurate estimates of leverage can be computed in constant\nnumber of passes. But the constant would be larger than 2.\n\n2\n\n\fInput: The features (matrix columns) x1, . . . , xn, and an integer k \u2264 n.\nOutput: An ordered list S of k indices.\n1.\n\nIn the initial pass compute:\n1.1. For i = 1, . . . , n set \u02dcxi = xi, vi = |\u02dcxi|2.\nAt the end of the pass set z1 = arg max\nFor each pass j = 2, . . . , k:\n2.1. For i = 1, . . . , n set vi to the square error of\n\napproximating xi by a linear combination of the columns in S.)\nvi, and initialize S = (z1).\n\n(\u02dcxi is the error vector of\n\n2.\n\ni\n\napproximating xi by a linear combination of the columns in S.\n\nAt the end of pass j set zj = arg max\n\ni\n\nvi, and add zj to S.\n\nFigure 1: The main steps of the QRP algorithm.\n\n2.3 Randomized ID\n\nIn a recent survey [19] Halko et.al. describe how to compute QR factorization using their random-\nized Interpolative Decomposition. Their approach produces an accurate Q as a basis of the data\nmatrix column space. They propose an ef\ufb01cient \u201crow extraction\u201d method for computing R, that\nworks when k, the desired rank, is similar to the rank of the data matrix. Otherwise the row extrac-\ntion introduces unacceptable inaccuracies, which led Halko et.al to recommend using an alternative\nO(kmn) technique in such cases.\n\n2.4 Our result, the complexity of the IQRP\n\nThe savings that the IQRP achieves depend on the data. The algorithm takes as input an integer value\nl, the length of a temporary buffer. As explained in Section 4 our implementation requires temporary\nstorage of l + 1, which takes (l + 1)m \ufb02oats. The following values depend on the data: the number\nof passes p, the number of IO-passes q (explained below), and a unit cost of orthogonalization c (see\nSection 4.3).\nIn terms of l and c the run time is 2mn + 4mnc + 4mlk. Our experiments show that for typical\ndatasets the value of c is below k. For l \u2248 k our experiments show that the number of passes is\ntypically much smaller than k. The number of passes is even smaller if one considers IO-passes. To\nexplain what we mean by IO-passes consider as an example a situation where the algorithm runs\nthree passes over the data. In the \ufb01rst pass all n features are being accessed. In the second, only two\nfeatures are being accessed. In the third, only one feature is being accessed. In this case we take\nn. We believe that q is a relevant measure of the algorithm pass\nthe number of IO-passes to be q=1+ 3\ncomplexity when skipping is cheap, so that the cost of a pass over the data is the amount of data that\nneeds to be read.\n\n3 The Businger Golub algorithm (QRP)\n\nIn this section we describe the QRP [9, 10] which forms the basis to the IQRP. The main steps\nare described in Figure 1. There are two standard implementations for Step 2.1 in Figure 1. The\n\ufb01rst is by means of the \u201cModi\ufb01ed Gram-Schmidt\u201d (e.g., [9]), and the second is by Householder\northogonalization (e.g., [9]). Both methods require approximately the same number of \ufb02ops, but\nerror analysis (see [9]) shows that the Householder approach is signi\ufb01cantly more stable.\n\n3.1 Memory-ef\ufb01cient implementations\n\nThe implementations shown in Figure 2 update the memory where the matrix A is stored. Speci\ufb01-\ncally, A is overwritten by the R component of the QR factorization. Since we are not interested in\nR, overwriting A may not be acceptable. The procedure shown in Figure 3 does not overwrite A,\nbut it is more costly. The \ufb02ops count is dominated by Steps 1 and 2, which cost at most 4(j \u2212 1)mn\nat pass j. Summing up for j = 1, . . . , k this gives a total \ufb02ops count of approximately 2k2mn \ufb02ops.\n\n3\n\n\fCompute zj, qj, Qj\nfor i = 1, . . . , n\n1. wi = qT\nj\u22121 \u02dcxi.\n2. \u02dcxi \u2190 \u02dcxi \u2212 wiqj\u22121.\n3. vi \u2190 vi \u2212 w2\ni .\nAt the end of the pass:\n4. zj = arg max\nvi.\n5. qj = xzj /|xzj|.\n6. Qj = (Qj\u22121, qj).\n\ni\n\nCompute zj, hj, Hj\nfor i = 1, . . . , n\n1. \u02dcxi \u2190 hj\u22121 \u02dcxi.\n2. wi = \u02dcxi(j) (the j\u2019th coordinate of \u02dcxi).\n3. vi \u2190 vi \u2212 w2\ni .\nAt the end of the pass:\n4. zj = arg max\n5. Create the Householder matrix hj from \u02dcxj.\n6. Hj = Hj\u22121hj.\n\nvi.\n\ni\n\nModi\ufb01ed Gram-Schmidt\n\nHouseholder orthogonalization\n\nFigure 2: Standard implementations of Step 2.1 of the QRP\n\nCompute zj, qj, Qj\nfor i = 1, . . . , n\n1. wi = QT\nj\u22121xi.\n2. vi = |xi|2 \u2212 |wi|2.\nAt the end of the pass:\n3. zj = arg max\n4. \u02dcqj = xzj \u2212 Qj\u22121wzj , qj = \u02dcqj/|\u02dcqj|.\n5. Qj = (Qj\u22121, qj).\n\nvi.\n\ni\n\nCompute zj, hj, Hj\nfor i = 1, . . . , n\n1. yi = Hj\u22121xi.\n\n2. vi =(cid:80)m\n\nt=j+1 y2\n\ni (t).\nAt the end of the pass:\n3. zj = arg max\n4. Create hj from yzj .\n5. Hj = Hj\u22121hj.\n\nvi.\n\ni\n\nModi\ufb01ed Gram-Schmidt\n\nHouseholder orthogonalization\n\nFigure 3: Memory-ef\ufb01cient implementations of Step 2.1 of the QRP\n\n4 The IQRP algorithm\n\nIn this section we describe our main result: the improved QRP. The algorithm maintains three or-\ndered lists of columns: The list F is the input list containing all columns. The list S contains\ncolumns that have already been selected. The list L is of size l, where l is a user de\ufb01ned parameter.\nFor each column xi in F the algorithm maintains an integer value ri and a real value vi. These\nvalues can be kept in core or a secondary memory. They are de\ufb01ned as follows:\n\nri \u2264 |S|,\n\nvi = vi(ri) = (cid:107)xi \u2212 QriQT\n\nri\n\nxi(cid:107)2\n\n(1)\n\nwhere Qri = (q1, . . . , qri) is an orthonormal basis to the \ufb01rst ri columns in S. Thus, vi(ri) is\nthe (squared) error of approximating xi with the \ufb01rst ri columns in S. In each pass the algorithm\nidenti\ufb01es the l candidate columns xi corresponding to the l largest values of vi(|S|). That is, the vi\nvalues are computed as the error of predicting each candidate by all columns currently in S. The\nidenti\ufb01ed l columns with the largest vi(|S|) are stored in the list L. In addition, the value of the\nl+1\u2019th largest vi(|S|) is kept as the constant BF . Thus, after a pass is terminated the following\ncondition holds:\n\nv\u03b1(r\u03b1) \u2264 BF\n\nfor all x\u03b1 \u2208 F \\ L.\n\n(2)\n\nThe list L and the value BF can be calculated in one pass using a binary heap data structure, with\nthe cost of at most n log(l + 1) comparisons. See [20, Chapter 9]. The main steps of the algorithm\nare described in Figure 4.\n\nDetails of Steps 2.0, 2.1 of the IQRP. The threshold value T is de\ufb01ned by:\n\n(cid:26)\u2212\u221e\n\nT =\n\ntop of the heap\n\n4\n\nif the heap is not full.\nif the heap is full.\n\n(3)\n\n\fInput: The matrix columns (features) x1, . . . , xn, and an integer k \u2264 n.\nOutput: An ordered list S of k indices.\n1.\n\n(The initial pass over F .)\n1.0. Create a min-heap of size l+1.\n1.1. Set vi(0) = |xi|2, ri = 0.\n\nIn one pass go over xi, i = 1, . . . , n:\n\nFill the heap with the candidates corresponding to the l+1 largest vi(0).\n\n1.2. At the end of the pass:\n\nSet BF to the value at the top of the heap.\nSet L to heap content excluding the top element.\nAdd to S as many candidates from L as possible using BF .\n\n2.\n\nRepeat until S has k candidates:\n2.0. Create a min-heap of size l+1.\n\nLet T be de\ufb01ned by (3).\nIn one pass go over xi, i = 1, . . . , n:\n\n2.1. Skip xi if vi(ri) \u2264 T . Otherwise update vi, ri, heap.\n2.2. At the end of the pass:\n\nSet BF = T .\nSet L to heap content excluding the top element.\nAdd to S as many candidates from L as possible using BF .\n\nFigure 4: The main steps of the IQRP algorithm.\n\nThus, when the heap is full, T is the value of v associated with the l+1\u2019th largest candidate encoun-\ntered so far. The details of Step 2.1 are shown in Figure 5. Step A.2.2.1 can be computed using\neither Gram-Schmidt or Householder, as shown in Figures 3 and 4.\n\nA.1. If vi(ri) \u2264 T skip xi.\nA.2. Otherwise check ri:\nA.2.1. If ri = |S| conditionally insert xi into the heap.\nA.2.2. If ri < |S|:\n\nA.2.2.1. Compute vi(|S|). Set ri = |S|.\nA.2.2.2. Conditionally insert xi into the heap.\n\nFigure 5: Details of Step 2.1\n\nDetails of Steps 1.2 and 2.2 of the IQRP. Here we are given the list L and the value of BF\nsatisfying (2). To move candidates from L to S run the QRP on L as long as the pivot value is above\nBF . (The pivot value is the largest value of vi(|S|) in L.) The details are shown in Figure 6.\n\nvi(|S|)\n\nB.1. z = arg max\ni\u2208L\nB.2. If vz(|S|) < BF , we are done exploiting L.\nB.3. Otherwise:\n\nB.3.1. Move z from L to S.\nB.3.2. Update the remaining candidates in L using either Gram-Schmidt or\n\nthe Householder procedure.\nFor example, with Householder:\nB.3.2.1. Create the Householder matrix hj from xz.\nB.3.2.2. for all x in L replace x with hjx.\n\nFigure 6: Details of Steps 1.2 and 2.2\n\n5\n\n\f4.1 Correctness\n\ni\n\n|xi|2. The IQRP selects v(cid:48)\n\nIn this section we show that the IQRP computes the same selection as the QRP. The proof\nthe number of columns in S. For j = 0 the QRP selects xj with\nis by induction on j,\nvj = |xj|2 = max\nj as the largest among the l largest values in F . There-\nj = maxxi\u2208L |xi|2 = maxxi\u2208F |xi|2 = vj. Now assume that for j = |S| the QRP and the\nfore v(cid:48)\nIQRP select the same columns in S (this is the inductive assumption). Let vj(|S|) be the value of\nj(|S|) be the value of the j+1\u2019th selection by the IQRP. We\nthe j+1\u2019th selection by the QRP, and let v(cid:48)\nj(|S|) = vj(|S|). The QRP selection of j satis\ufb01es: vj(|S|) = maxxi\u2208F vi(|S|).\nneed to show that v(cid:48)\nObserve that if xi \u2208 L then ri = |S|. (Initially L is created from the heap elements that have\nri = |S|. Once S is increased in Step B.3.1 the columns in L are updated according to B.3.2 so that\nthey all satisfy ri = |S|.) The IQRP selection satis\ufb01es:\n\nj(|S|) = max\nv(cid:48)\nxi\u2208L\n\nvi(|S|)\n\nand v(cid:48)\n\nj(|S|) \u2265 BF .\n\n(4)\n\nAdditionally for all x\u03b1 \u2208 F \\ L:\n\n(5)\nThis follows from (2), the observation that v\u03b1(r) is monotonically decreasing in r, and r\u03b1 \u2264 |S|.\nTherefore, combining (4), and (5) we get\n\nBF \u2265 v\u03b1(r\u03b1) \u2265 v\u03b1(|S|).\n\nj(|S|) = max\nv(cid:48)\nxi\u2208F\n\nvi(|S|) = vj(|S|).\n\nThis completes the proof by induction.\n\n4.2 Termination\n\nTo see that the algorithm terminates it is enough to observe that at least one column is selected in\neach pass. The condition at Step B.2 in Figure 6 cannot hold at the \ufb01rst time in a new L. The value\nof BF is the l+1 largest vi(|S|), while the maximum at B.1 is among the l largest vi(|S|).\n\n4.3 Complexity\n\nthe number of selected features\nnumber of passes\na unit cost of orthogonalizing F\n\nThe formulas in this section describe the complexity of the IQRP in terms of the following:\nn the number of features (matrix columns)\nk\np\nc\nThe value of c depends on the implementation of Step A.2.2.1 in Figure 5. We write cmemory for the\nvalue of c in the memory-ef\ufb01cient implementation, and c\ufb02ops for the faster implementation (in terms\nof \ufb02ops). We use the following notation. At pass j the number of selected columns is kj, and the\nnumber of columns that were not skipped in Step 2.1 of the IQRP (same as Step A.1) is nj.\nThe number of \ufb02ops in the memory-ef\ufb01cient implementation can be shown to be\n\nm the number of objects (matrix rows)\nuser provided parameter. 1 \u2264 l \u2264 n\nl\nnumber of IO-passes\nq\n\np(cid:88)\n\nj=2\n\nnj\nn\n\nj\u22121(cid:88)\n\nj(cid:48)=1\n\n\ufb02opsmemory = 2mn + 4mnc + 4mlk, where c =\n\nkj(cid:48)\n\n(6)\n\nObserve that c \u2264 k2/2, so that for l < n the worst case behavior is the same as the memory-\noptimized QRP algorithm, which is O(k2mn). We show in Section 5 that the typical run time is\nmuch faster. In particular, the dependency on k appears to be linear and not quadratic.\nFor the faster implementation that overwrites the input it can be shown that:\n\n\ufb02opstime = 2mn + 4m\n\n(7)\nSince \u02dcri \u2264 k \u2212 1 it follows that \ufb02opstime \u2264 4kmn. Thus, the worst case behavior is the same as the\n\ufb02ops-ef\ufb01cient QRP algorithm.\n\n\u02dcri, where \u02dcri is the value of ri at termination.\n\ni=1\n\nn(cid:88)\n\n6\n\n\fMemory in the memory-ef\ufb01cient implementation requires km in-core \ufb02oats, and additional memory\nfor the heap, that can be reused for the list L. Additional memory to store and manipulate vi, ri\nfor i = 1, . . . , n is roughly 2n \ufb02oats. Observe that these memory locations are being accessed\nconsecutively, and can be ef\ufb01ciently stored and manipulated out-of-core. The data itself, the matrix\nA, is stored out-of-core. When the method of Figure 3 is used in A.2.2.1, these matrix values are\nread-only.\nIO-passes. We wish to distinguish between a pass where the entire data is accessed and a pass where\nmost of the data is skipped. This suggests the following de\ufb01nition for the number of IO-passes:\n\nq =(cid:80)p\n\nj=1\n\nn = 1 +(cid:80)p\n\nnj\n\nnj\nn .\n\nj=2\n\nNumber of \ufb02oating point comparisons. Testing for the skipping and manipulating the heap requires\n\ufb02oating point comparisons. The number of comparisons is n(p \u2212 1 + (q \u2212 1) log2(l + 1)). This\ndoes not affect the asymptotic complexity since the number of \ufb02ops is much larger.\n\n5 Experimental results\n\nWe describe results on several commonly used datasets. \u201cDay1\u201d, with m = 20, 000 and n =\n3, 231, 957 is part of the \u201dURL reputation\u201d collection at the UCI Repository. \u201cthrombin\u201d, with\nm = 1, 909 and n = 139, 351 is the data used in KDD Cup 2001. \u201cAmazon\u201d, with m = 1, 500\nand n = 10, 000 is part of the \u201cAmazon Commerce reviews set\u201d and was obtained from the UCI\nRepository. \u201cgisette\u201d, with m = 6, 000 and n = 5, 000 was used in NIPS 2003 selection challenge.\n\nMeasurements. We vary k, and report the following: \ufb02opsmemory, \ufb02opstime are the ratios between\nthe number of \ufb02ops used by the IQRP and kmn, for the memory-ef\ufb01cient orthogonalization and\nthe time-ef\ufb01cient orthogonalization. # passes is the number of passes needed to select k features.\n# IO-passes is discussed in sections 2.4 and 4.3. It is the number of times that the entire data is read.\nThus, the ratio between the number of IO-passes and the number of passes is the fraction of the data\nthat was not skipped.\n\nRun time. The number of \ufb02ops of the QRP is between 2kmn and 4kmn. We describe experiments\nwith the list size l taken as l = k. For Day1 the number of \ufb02ops beats the QRP by a factor of more\nthan 100. For the other datasets the results are not as impressive. There are still signi\ufb01cant savings\nfor small and moderate values of k (say up to k = 600), but for larger values the savings are smaller.\nMost interesting is the observation that the memory-ef\ufb01cient implementation of Step 2.1 is not much\nslower than the optimization for time. Recall that the memory-optimized QRP is k times slower than\nthe time-optimized QRP. In our experiments they differ by no more than a factor of 4.\n\nNumber of passes. We describe experiments with the list size l taken as l = k, and also with\nl = 100 regardless of the value of k. The QRP takes k passes for selecting k features. For the\nDay1 dataset we observed a reduction by a factor of between 50 to 250 in the number of passes. For\nIO-passes, the reduction goes up to a factor of almost 1000. Similar improvements are observed for\nthe Amazon and the gisette datasets. For the thrombin it is slightly worse, typically a reduction by\na factor of about 70. The number of IO-passes is always signi\ufb01cantly below the number of passes,\ngiving a reduction by factors up to 1000. For the recommended setting of l = k we observed the\nfollowing. In absolute terms the number of passes was below 10 for most of the data; the number of\nIO-passes was below 2 for most of the data.\n\n6 Concluding remarks\n\nThis paper describes a new algorithm for unsupervised feature selection. Based on the experiments\nwe recommend using the memory-ef\ufb01cient implementation and setting the parameter l = k. As\nexplained earlier the algorithm maintains 2 numbers for each column, and these can also be kept\nin-core. This gives a 2(km + n) memory footprint.\nOur experiments show that for typical datasets the number of passes is signi\ufb01cantly smaller than\nk. In situations where memory can be skipped the notion of IO-passes may be more accurate than\npasses. IO-passes indicate the amount of data that was actually read and not skipped.\n\n7\n\n\f\u00b710\u22122\n\n\ufb02opsmemory\n\ufb02opstime\n\nDay1, m = 20, 000, n = 3, 231, 957\n\ns\ne\ns\ns\na\np\n\nf\no\n\nr\ne\nb\nm\nu\nn\n\n5\n\n4\n\n3\n\n2\n\n1\n\n#passes\n\n#IO-passes\n\n#passes\n\n#IO-passes\n\ns\ne\ns\ns\na\np\n\nf\no\n\nr\ne\nb\nm\nu\nn\n\n20\n\n15\n\n10\n\n5\n\n0\n\n0\n\n200\n\n400\n\n600\nk (l = k)\n\n800\n\n1,000\n\n0\n\n200\n\n400\n600\nk (l = k)\n\n800\n\n1,000\n\n0\n\n200\n\n400\n600\nk (l = 100)\n\n800\n\n1,000\n\nthrombin, m = 1, 909, n = 139, 351\n\n\ufb02opsmemory\n\ufb02opstime\n\ns\ne\ns\ns\na\np\n\nf\no\n\nr\ne\nb\nm\nu\nn\n\n15\n\n10\n\n5\n\n0\n\n#passes\n\n#IO-passes\n\n#passes\n\n#IO-passes\n\ns\ne\ns\ns\na\np\n\nf\no\n\nr\ne\nb\nm\nu\nn\n\n40\n\n20\n\n0\n\n0\n\n200\n\n400\n600\nk (l = k)\n\n800\n\n1,000\n\n0\n\n200\n\n600\n400\nk (l = k)\n\n800\n\n1,000\n\n0\n\n200\n\n400\n600\nk (l = 100)\n\n800\n\n1,000\n\n\ufb02opsmemory\n\ufb02opstime\n\nAmazon, m = 1, 500, n = 10, 000\n\n#passes\n\n#IO-passes\n\ns\ne\ns\ns\na\np\n\nf\no\n\nr\ne\nb\nm\nu\nn\n\n5\n\n4\n\n3\n\n2\n\n1\n\n#passes\n\n#IO-passes\n\ns\ne\ns\ns\na\np\nf\no\n\nr\ne\nb\nm\nu\nn\n\n15\n\n10\n\n5\n\n0\n\n0\n\n200\n\n400\n600\nk (l = k)\n\n800\n\n1,000\n\n0\n\n200\n\n400\n600\nk (l = k)\n\n800\n\n1,000\n\n0\n\n200\n\n400\nk (l = 100)\n\n600\n\n\ufb02opsmemory\n\ufb02opstime\n\ngisette, m = 6, 000, n = 5, 000\n\ns\ne\ns\ns\na\np\nf\no\n\nr\ne\nb\nm\nu\nn\n\n5\n\n4\n\n3\n\n2\n\n1\n\n#passes\n\n#IO-passes\n\n#passes\n\n#IO-passes\n\ns\ne\ns\ns\na\np\nf\no\n\nr\ne\nb\nm\nu\nn\n\n15\n\n10\n\n5\n\n0\n\n4\n\n3\n\n2\n\n1\n\n3\n\n2\n\n1\n\n0\n\n4\n\n3\n\n2\n\n1\n\n0\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\nn\nm\nk\n/\ns\np\no\n\ufb02\n\nn\nm\nk\n/\ns\np\no\n\ufb02\n\nn\nm\nk\n/\ns\np\no\n\ufb02\n\nn\nm\nk\n/\ns\np\no\n\ufb02\n\n0\n\n200\n\n600\n400\nk (l = k)\n\n800\n\n1,000\n\n0\n\n200\n\n600\n400\nk (l = k)\n\n800\n\n1,000\n\n0\n\n200\n\n600\n400\nk (l = 100)\n\n800\n\n1,000\n\nFigure 7: Results of applying the IQRP to several datasets with varying k, and l = k.\n\nThe performance of the IQRP depends on the data. Therefore, the improvements that we observe\ncan also be viewed as an indication that typical datasets are \u201ceasy\u201d. This appears to suggest that\nworst case analysis should not be considered as the only criterion for evaluating feature selection\nalgorithms. Comparing the IQRP to the current state-of-the-art randomized algorithms that were\nreviewed in Section 2.2 we observe that the IQRP is competitive in terms of the number of passes\nand appears to outperform these algorithms in terms of the number of IO-passes. On the other hand,\nit may be less accurate.\n\n8\n\n\fReferences\n[1] M. Gu and S. C. Eisenstat. Ef\ufb01cient algorithms for computing a strong rank-revealing QR factorization.\n\nSIAM J. Computing, 17(4):848\u2013869, 1996.\n\n[2] C. Boutsidis, M. W. Mahoney, and P. Drineas. An improved approximation algorithm for the column\nsubset selection problem. In Claire Mathieu, editor, Proceedings of the Twentieth Annual ACM-SIAM\nSymposium on Discrete Algorithms, SODA 2009, New York, NY, USA, January 4-6, 2009, pages 968\u2013\n977. SIAM, 2009.\n\n[3] C. Boutsidis, P. Drineas, and M. Magdon-Ismail. Near-optimal column-based matrix reconstruction,\n\nFebruary 2011. arXiv e-print (arXiv:1103.0995).\n\n[4] A. Dasgupta, P. Drineas, B. Harb, V. Josifovski, and M. W. Mahoney. Feature selection methods for text\nclassi\ufb01cation. In Pavel Berkhin, Rich Caruana, and Xindong Wu, editors, KDD, pages 230\u2013239. ACM,\n2007.\n\n[5] C. Boutsidis, P. Drineas, and M. Magdon-Ismail. Sparse features for PCA-like linear regression. In John\nShawe-Taylor, Richard S. Zemel, Peter L. Bartlett, Fernando C. N. Pereira, and Kilian Q. Weinberger,\neditors, NIPS, pages 2285\u20132293, 2011.\n\n[6] V. Guruswami and A. K. Sinop. Optimal column-based low-rank matrix reconstruction. In Yuval Rabani,\neditor, Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, SODA\n2012, Kyoto, Japan, January 17-19, 2012, pages 1207\u20131214. SIAM, 2012.\n\n[7] Z. Li, Y. Yang, J. Liu, X. Zhou, and H. Lu. Unsupervised feature selection using nonnegative spectral\nanalysis. In Proceedings of the Twenty-Sixth AAAI Conference on Arti\ufb01cial Intelligence, July 22-26, 2012,\nToronto, Ontario, Canada. AAAI Press, 2012.\n\n[8] S. Zhang, H.S. Wong, Y. Shen, and D. Xie. A new unsupervised feature ranking method for gene expres-\nsion data based on consensus af\ufb01nity. IEEE/ACM Transactions on Computational Biology and Bioinfor-\nmatics, 9(4):1257\u20131263, July 2012.\n\n[9] G. H. Golub and C. F. Van-Loan. Matrix computations. The Johns Hopkins University Press, third edition,\n\n1996.\n\n[10] P. Businger and G. H. Golub. Linear least squares solutions by Householder transformations. Numer.\n\nMath., 7:269\u2013276, 1965.\n\n[11] A. C\u00b8 ivril and M. Magdon-Ismail. Column subset selection via sparse approximation of SVD. Theoretical\n\nComputer Science, 421:1\u201314, March 2012.\n\n[12] A. M. Frieze, R. Kannan, and S. Vempala. Fast Monte-Carlo algorithms for \ufb01nding low-rank approxima-\n\ntions. In IEEE Symposium on Foundations of Computer Science, pages 370\u2013378, 1998.\n\n[13] A. M. Frieze, R. Kannan, and S. Vempala. Fast Monte-Carlo algorithms for \ufb01nding low-rank approxima-\n\ntions. Journal of the ACM, 51(6):1025\u20131041, 2004.\n\n[14] A. Deshpande, L. Rademacher, S. Vempala, and G. Wang. Matrix approximation and projective clustering\n\nvia volume sampling. Theory of Computing, 2(12):225\u2013247, 2006.\n\n[15] A. Deshpande and L. Rademacher. Ef\ufb01cient volume sampling for row/column subset selection. In FOCS,\n\npages 329\u2013338. IEEE Computer Society Press, 2010.\n\n[16] M. W. Mahoney and P. Drineas. CU R matrix decompositions for improved data analysis. Proceedings\n\nof the National Academy of Sciences, 106(3):697\u2013702, 2009.\n\n[17] P. Drineas, M. Magdon-Ismail, M. W. Mahoney, and D. P. Woodruff. Fast approximation of matrix\n\ncoherence and statistical leverage. Journal of Machine Learning Research, 13:3441\u20133472, 2012.\n\n[18] K. L. Clarkson and D. P. Woodruff. Low rank approximation and regression in input sparsity time. arXiv\n\ne-print (arXiv:1207.6365v4), April 2013.\n\n[19] N. Halko, P. G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms\n\nfor constructing approximate matrix decompositions. SIAM Review, 53(2):217\u2013288, 2011.\n\n[20] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to algorithms. MIT Press and McGraw-Hill\n\nBook Company, third edition, 2009.\n\n9\n\n\f", "award": [], "sourceid": 800, "authors": [{"given_name": "Crystal", "family_name": "Maung", "institution": "UT Dallas"}, {"given_name": "Haim", "family_name": "Schweitzer", "institution": "UT Dallas"}]}