{"title": "\u03a3-Optimality for Active Learning on Gaussian Random Fields", "book": "Advances in Neural Information Processing Systems", "page_first": 2751, "page_last": 2759, "abstract": "A common classifier for unlabeled nodes on undirected graphs uses label propagation from the labeled nodes, equivalent to the harmonic predictor on Gaussian random fields (GRFs). For active learning on GRFs, the commonly used V-optimality criterion queries nodes that reduce the L2 (regression) loss. V-optimality satisfies a submodularity property showing that greedy reduction produces a (1 \u2212 1/e) globally optimal solution. However, L2 loss may not characterise the true nature of 0/1 loss in classification problems and thus may not be the best choice for active learning. We consider a new criterion we call \u03a3-optimality, which queries the node that minimizes the sum of the elements in the predictive covariance. \u03a3-optimality directly optimizes the risk of the surveying problem, which is to determine the proportion of nodes belonging to one class. In this paper we extend submodularity guarantees from V-optimality to \u03a3-optimality using properties specific to GRFs. We further show that GRFs satisfy the suppressor-free condition in addition to the conditional independence inherited from Markov random fields. We test \u03a3-optimality on real-world graphs with both synthetic and real data and show that it outperforms V-optimality and other related methods on classification.", "full_text": "\u03a3-Optimality for Active Learning on Gaussian\n\nRandom Fields\n\nYifei Ma\n\nMachine Learning Department\nCarnegie Mellon University\nyifeim@cs.cmu.edu\n\nRoman Garnett\n\nrgarnett@uni-bonn.de\n\nComputer Science Department\n\nUniversity of Bonn\n\nJeff Schneider\nRobotics Institute\n\nCarnegie Mellon University\nschneide@cs.cmu.edu\n\nAbstract\n\nA common classi\ufb01er for unlabeled nodes on undirected graphs uses label propaga-\ntion from the labeled nodes, equivalent to the harmonic predictor on Gaussian ran-\ndom \ufb01elds (GRFs). For active learning on GRFs, the commonly used V-optimality\ncriterion queries nodes that reduce the L2 (regression) loss. V-optimality satis-\n\ufb01es a submodularity property showing that greedy reduction produces a (1\u2212 1/e)\nglobally optimal solution. However, L2 loss may not characterise the true nature\nof 0/1 loss in classi\ufb01cation problems and thus may not be the best choice for active\nlearning.\nWe consider a new criterion we call \u03a3-optimality, which queries the node that\nminimizes the sum of the elements in the predictive covariance. \u03a3-optimality\ndirectly optimizes the risk of the surveying problem, which is to determine the\nproportion of nodes belonging to one class. In this paper we extend submodularity\nguarantees from V-optimality to \u03a3-optimality using properties speci\ufb01c to GRFs.\nWe further show that GRFs satisfy the suppressor-free condition in addition to\nthe conditional independence inherited from Markov random \ufb01elds. We test \u03a3-\noptimality on real-world graphs with both synthetic and real data and show that it\noutperforms V-optimality and other related methods on classi\ufb01cation.\n\n1\n\nIntroduction\n\nReal-world data are often presented as a graph where the nodes in the graph bear labels that vary\nsmoothly along edges. For example, for scienti\ufb01c publications, the content of one paper is highly\ncorrelated with the content of papers that it references or is referenced by, the \ufb01eld of interest of a\nscholar is highly correlated with other scholars s/he coauthors with, etc. Many of these networks\ncan be described using an undirected graph with nonnegative edge weights set to be the strengths of\nthe connections between nodes.\nThe model for label prediction in this paper is the harmonic function on the Gaussian random \ufb01eld\n(GRF) by Zhu et al. (2003). It can generalize two popular and intuitive algorithms: label propagation\n(Zhu & Ghahramani, 2002), and random walk with absorptions (Wu et al., 2012). GRFs can be\nseen as a Gaussian process (GP) (Rasmussen & Williams, 2006) with its (maybe improper) prior\ncovariance matrix whose (pseudo)inverse is set to be the graph Laplacian.\nLike other learning problems, labels may be insuf\ufb01cient and expensive to gather, especially if one\nwants to discover a new phenomenon on a graph. Active learning addresses these issues by making\nautomated decisions on which nodes to query for labels from experts or the crowd. Some popular\ncriteria are empirical risk minimization (Settles, 2010; Zhu et al., 2003), mutual information gain\n(Krause et al., 2008), and V-optimality (Ji & Han, 2012). Here we consider an alternative criterion,\n\u03a3-optimality, and establish several related theoretical results. Namely, we show that greedy reduc-\ntion of \u03a3-optimality provides a (1\u2212 1/e) approximation bound to the global optimum. We also show\n\n1\n\n\fthat Gaussian random \ufb01elds satisfy the suppressor-free condition, described below. Finally, we show\nthat \u03a3-optimality outperforms other approaches for active learning with GRFs for classi\ufb01cation.\n\n1.1 V-optimality on Gaussian Random Fields\n\nJi & Han (2012) proposed greedy variance minimization as a cheap and high pro\ufb01le surrogate active\nclassi\ufb01cation criterion. To decide which node to query next, the active learning algorithm \ufb01nds the\nunlabeled node which leads to the smallest average predictive variance on all other unlabeled nodes.\nIt corresponds to standard V-optimality in optimal experiment design.\nWe will discuss several aspects of V-optimality on GRFs below: 1. The motivation behind V-\noptimality can be paraphrased as the expected risk minimization with the L2-surrogate loss (Sec-\ntion 2.1). 2. The greedy solution to the set optimization problem in V-optimality is comparable to\nthe global solution up to a constant (Theorem 1). 3. The greedy application of V-optimality can\nalso be interpreted as a heuristic which selects nodes that have high correlation to nodes with high\nvariances (Observation 4).\nSome previous work is related to point 2 above. Nemhauser et al. (1978) shows that any submodular,\nmonotone and normalized set function yields a (1 \u2212 1/e) global optimality guarantee for greedy\nsolutions. Our proof techniques coincides with Friedland & Gaubert (2011) in principle, but we\nare not restricted to spectral functions. Krause et al. (2008) showed a counter example where the\nV-optimality objective function with GP models does not satisfy submodularity.\n\n1.2 \u03a3-optimality on Gaussian Random Fields\n\nWe de\ufb01ne \u03a3-optimality on GRFs to be another variance minimization criterion that minimizes the\nsum of all entries in the predictive covariance matrix. As we will show in Lemma 7, the predictive\ncovariance matrix is nonnegative entry-wise and thus the de\ufb01nition is proper. \u03a3-optimality was orig-\ninally proposed by Garnett et al. (2012) in the context of active surveying, which is to determine the\nproportion of nodes belonging to one class. However, we focus on its performance as a criterion in\nactive classi\ufb01cation heuristics. The survey-risk of \u03a3-optimality replaces the L2-risk of V-optimality\nas an alternative surrogate risk for the 0/1-risk.\nWe also prove that the greedy application of \u03a3-optimality has a similar theoretical bound as V-\noptimality. We will show that greedily minimizing \u03a3-optimality empirically outperforms greedily\nminimizing V-optimality on classi\ufb01cation problems. The exact reason explaining the superiority of\n\u03a3-optimality as a surrogate loss in the GRF model is still an open question, but we observe that\n\u03a3-optimality tends to select cluster centers whereas V-optimality goes after outliers (Section 3.1).\nFinally, greedy application of both \u03a3-optimality and V-optimality need O(N ) time per query candi-\ndate evaluation after one-time inverse of a N \u00d7 N matrix.\n\n1.3\n\nGRFs Are Suppressor Free\n\nIn linear regression, an explanatory variable is called a suppressor if adding it as a new variable\nenhances correlations between the old variables and the dependent variable (Walker, 2003; Das &\nKempe, 2008). Suppressors are persistent in real-world data. We show GRFs to be suppressor-\nfree.\nIntuitively, this means that with more labels acquired, the conditional correlation between\nunlabeled nodes decreases even when their Markov blanket has not formed. That GRFs present\nnatural examples for the otherwise obscure suppressor-free condition is interesting.\n\n2 Learning Model & Active Learning Objectives\n\nLaplacian of G to be L = diag (W 1) \u2212 W , i.e., lii = (cid:80)\n\nWe use Gaussian random \ufb01eld/label propagation (GRF/LP) as our learning model. Suppose the\ndataset can be represented in the form of a connected undirected graph G = (V, E) where each\nnode has an (either known or unknown) label and each edge eij has a \ufb01xed nonnegative weight\nwij(= wji) that re\ufb02ects the proximity, similarity, etc. between nodes vi and vj. De\ufb01ne the graph\nj wij and lij = \u2212wij when i (cid:54)= j. Let\nL\u03b4 = L + \u03b4I be the generalized Laplacian obtained by adding self-loops. In the following, we will\nwrite L to also encompass \u03b2L\u03b4 for the set of hyper-parameters \u03b2 > 0 and \u03b4 \u2265 0.\n\n2\n\n\fThe binary GRF is a Bayesian model to generate yi \u2208 {0, +1} for every node vi according to,\n\n(cid:16)(cid:88)\n\n(cid:110) \u2212 \u03b2\n\n2\n\np(y) \u221d exp\n\n(cid:88)\n\n(cid:17)(cid:111)\n\n(cid:18)\n\n(cid:19)\n\n.\n\nwij(yi \u2212 yj)2 + \u03b4\n\ny2\ni\n\n= exp\n\n\u2212 1\n2\n\nyT Ly\n\ni,j\n\ni\n\n(2.1)\n\n(cid:19)\n\n(v\u2212(cid:96))),\n\nLu(cid:96) Lu\n\nu ) = N (\u02c6yu, L\u22121\n\n. By convention, L\u22121\n\nPr(yu|y(cid:96)) \u221d N (\u02c6yu, L\u22121\n\n(cid:18) L(cid:96) L(cid:96)u\n\nSuppose nodes (cid:96) = {v(cid:96)1, . . . , v(cid:96)|(cid:96)|} are labeled as y(cid:96) = (y(cid:96)1 , . . . , y(cid:96)|(cid:96)|)T ; A GRF infers the output\ndistribution on unlabeled nodes, yu = (yu1, . . . , yu|u|)T by the conditional distribution given y(cid:96), as\n(2.2)\nwhere \u02c6yu = (\u2212L\u22121\nu Lu(cid:96)y(cid:96)) is the vector of predictive means on unlabeled nodes and Lu is the\nprincipal submatrix consisting of the unlabeled row and column indices in L, that is, the lower-right\nblock of L =\n(v\u2212(cid:96)) means the inverse of the principal submatrix.\nWe use L(v\u2212(cid:96)) and Lu interchangeably because (cid:96) and u partition the set of all nodes v.\nFinally, GRF, or GRF/LP, is a relaxation of the binary GRF to continuous outputs, because the latter is\ncomputationally intractable even for a-priori generations. LP stands for label propagation, because\nthe predictive mean on a node is the probability of a random walk leaving that node hitting a positive\nlabel before hitting a zero label. For multi-class problems, Zhu et al. (2003) proposed the harmonic\npredictor which looks at predictive means in one-versus-all comparisons.\nRemark: An alternative approximation to the binary GRF is the GRF-sigmoid model, which draws\nthe binary outputs from Bernoulli distributions with means set to be the sigmoid function of the GRF\n(latent) variables. However, this alternative is very slow to compute and may not be compatible with\nthe theoretical results in this paper.\n\n2.1 Active Learning Objective 1: L2 Risk Minimization (V-Optimality)\n\nSince in GRFs, regression responses are taken directly as probability predictions, it is computation-\nally and analytically more convenient to apply the regression loss directly in the GRF as in Ji & Han\n(2012). Assume the L2 loss to be our classi\ufb01cation loss. The risk function, whose input variable is\nthe labeled subset (cid:96), is:\n\n(yui \u2212 \u02c6yui)2 = E\n\nE\n\n(yui \u2212 \u02c6yui)2\n\n= tr(L\u22121\nu ).\n\n(2.3)\n\nRV ((cid:96)) = Ey(cid:96)yu (cid:88)\n\nui\u2208u\n\n(cid:34)\n\n(cid:34)(cid:88)\n\nui\u2208u\n\n(cid:35)(cid:35)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)y(cid:96)\n\nThis risk is written with a subscript V because minimizing (2.3) is also the V-optimality criterion,\nwhich minimizes mean prediction variance in active learning.\nIn active learning, we strive to select a subset (cid:96) of nodes to query for labels, constrained by a given\nbudget C, such that the risk is minimized. Formally,\n\narg min\n(cid:96): |(cid:96)|\u2264C\n\nR((cid:96)) = RV ((cid:96)) = tr(L\u22121\n\n(v\u2212(cid:96))).\n\n(2.4)\n\n2.2 Active Learning Objective 2: Survey Risk Minimization (\u03a3-Optimality)\n\nAnother objective building on the GRF model (2.2) is to determine the proportion of nodes belonging\nto class 1, as would happen when performing a survey. For active surveying, the risk would be:\n\n(cid:1)2\n\n= E(cid:2)E(cid:2)(cid:0)1T yu \u2212 1T \u02c6yu\n\n(cid:1)2|y(cid:96)\n\n(cid:3)(cid:3) = 1T L\u22121\n\n(2.5)\n\nR\u03a3((cid:96)) = Ey(cid:96)yu(cid:0) (cid:88)\n\nyui \u2212 (cid:88)\n\n\u02c6yui\n\nu 1,\n\nui\u2208u\n\nui\u2208u\n\nwhich could substitute the risk R((cid:96)) in (2.4) and yield another heuristic for selecting nodes in batch\nactive learning. We will refer to this modi\ufb01ed optimization objective as the \u03a3-optimality heuristic:\n(2.6)\n\nR((cid:96)) = R\u03a3((cid:96)) = 1T L\u22121\n\n(v\u2212(cid:96))1.\n\narg min\n(cid:96): |(cid:96)|\u2264C\n\nFurther, we will also consider the application of \u03a3-optimality in active classi\ufb01cation because (2.6) is\nanother metric of the predictive variance. Surprisingly, although both (2.3) and (2.5) are approxima-\ntions of the real objective (the 0/1 risk), greedy reduction of the \u03a3-optimality criterion outperforms\ngreedy reduction of the V-optimality criterion in active classi\ufb01cation (Section 3.1 and 5.1), as well\nas several other methods including expected error reduction.\n\n3\n\n\f2.3 Greedy Sequential Application of V/\u03a3-Optimality\n\nBoth (2.4) and (2.6) are subset optimization problems. Calculating the global optimum may be\nintractable. As will be shown later in the theoretical results, the reduction of both risks are submod-\nular set functions and the greedy sequential update algorithm yields a solution that has a guaranteed\napproximation ratio to the optimum (Theorem 1).\nAt the k-th query decision, denote the covariance matrix conditioned on the previous (k\u2212 1) queries\nas C = (L(v\u2212(cid:96)(k\u22121)))\u22121. By Shur\u2019s Lemma (or the GP-regression update rule), the one-step look-\nahead covariance matrix conditioned on (cid:96)(k\u22121) \u222a {v}, denoted as C(cid:48) = (L(v\u2212((cid:96)(k\u22121)\u222a{v})))\u22121, has\n\nthe following update formula: (cid:18)C(cid:48)\n\n(cid:19)\n\n0\n0\n\n0\n\n= C \u2212 1\nCvv\n\n\u00b7 C:vCv:,\n\n(2.7)\n\nwhere without loss of generality v was positioned as the last node. Further denoting Cij = \u03c1ij\u03c3i\u03c3j,\nwe can put (2.7) inside R\u03a3(\u00b7) and RV (\u00b7) to get the following equivalent criteria:\n\nV-optimality : v(k)\u2217 = arg max\n\nv\u2208u\n\n\u03a3-optimality : v(k)\u2217 = arg max\n\nv\u2208u\n\n3 Theoretical Results & Insights\n\n(cid:80)\n((cid:80)\n\nt\u2208u(Cvt)2\n\nCvv\nt\u2208u Cvt)2\nCvv\n\n(cid:88)\n(cid:88)\n\nt\u2208u\n\nt\u2208u\n\n=\n\nvt\u03c32\n\u03c12\nt ,\n\n= (\n\n\u03c1vt\u03c3t)2.\n\n(2.8)\n\n(2.9)\n\nFor the general GP model, greedy optimization of the L2 risk has no guarantee that the solution\ncan be comparable to the brute-force global optimum (taking exponential time to compute), because\nthe objective function, the trace of the predictive covariance matrix, fails to satisfy submodularity\nin all cases (Krause et al., 2008). However, in the special case of GPs with kernel matrix equal to\nthe inverse of a graph Laplacian (with (cid:96) (cid:54)= \u2205 or \u03b4 > 0), the GRF does provide such theoretical\nguarantees, both for V-optimality and \u03a3-optimality. The latter is a novel result.\nThe following theoretical results concern greedy maximization of the risk reduction function (which\nis shown to be submodular): R\u2206((cid:96)) = R(\u2205) \u2212 R((cid:96)) for either R(\u00b7) = RV (\u00b7) or R\u03a3(\u00b7).\nTheorem 1 (Near-optimal guarantee for greedy applications of V/\u03a3-optimality). In risk reduction,\n(3.1)\nwhere R\u2206((cid:96)) = R(\u2205) \u2212 R((cid:96)) for either R(\u00b7) = RV (\u00b7) or R\u03a3(\u00b7), e is Euler\u2019s number, (cid:96)g is the\ngreedy optimizer, and (cid:96)\u2217 is the true global optimizer under the constraint |(cid:96)\u2217| \u2264 |(cid:96)g|.\nAccording to Nemhauser et al. (1978), it suf\ufb01ces to show the following properties of R\u2206((cid:96)):\nLemma 2 (Normalization, Monotonicity, and Submodularity). \u2200(cid:96)1 \u2282 (cid:96)2 \u2282 v, v \u2208 v,\n\nR\u2206((cid:96)g) \u2265 (1 \u2212 1/e) \u00b7 R\u2206((cid:96)\u2217),\n\nR\u2206(\u2205) = 0,\nR\u2206((cid:96)2) \u2265 R\u2206((cid:96)1),\n\n(cid:0)(cid:96)1 \u222a {v}(cid:1) \u2212 R\u2206((cid:96)1) \u2265 R\u2206\n\n(cid:0)(cid:96)2 \u222a {v}(cid:1) \u2212 R\u2206((cid:96)2).\n\nR\u2206\n\n(3.2)\n(3.3)\n(3.4)\n\nAnother suf\ufb01cient condition for Theorem 1, which is itself an interesting observation, is the\nsuppressor-free condition. Walker (2003) describes a suppressor as a variable, knowing which will\nsuddenly create a strong correlation between the predictors. An example is yi + yj = yk. Knowing\nany one of these will create correlations between the others. Walker further states that suppressors\nare common in regression problems. Das & Kempe (2008) extend the suppressor-free condition to\nsets and showed that this condition is suf\ufb01cient to prove (2.3). Formally, the condition is:\n\n(cid:12)(cid:12)corr(yi, yj | (cid:96)1 \u222a (cid:96)2)(cid:12)(cid:12) \u2264(cid:12)(cid:12)corr(yi, yj | (cid:96)1)(cid:12)(cid:12)\n\n(3.5)\nIt may be easier to understand (3.5) as a decreasing correlation property.\nIt is well known for\nMarkov random \ufb01elds that the labels of two nodes on a graph become independent given labels of\ntheir Markov blanket. Here we establish that GRF boasts more than that: the correlation between any\ntwo nodes decreases as more nodes get labeled, even before a Markov blanket is formed. Formally:\n\n\u2200vi, vj \u2208 v,\u2200(cid:96)1, (cid:96)2 \u2282 v.\n\n4\n\n\fTheorem 3 (Suppressor-Free Condition). (3.5) holds for pairs of nodes in the GRF model. Note\nthat since the conditional covariance of the GRF model is L\u22121\n(v\u2212(cid:96)), we can properly de\ufb01ne the corre-\nsponding conditional correlation to be\ncorr(yu|(cid:96)) = D\u2212 1\n\n2 , with D = diag\n\n(3.6)\n\n2 L\u22121\n\n(v\u2212(cid:96))D\u2212 1\n\n(cid:16)\n\nL\u22121\n(v\u2212(cid:96))\n\n.\n\n(cid:17)\n\n3.1\n\nInsights From Comparing the Greedy Applications of the \u03a3/V-Optimality Criteria\n\nBoth the V/\u03a3-optimality are approximations to the 0/1 risk minimization objective. Unfortunately,\nwe cannot theoretically reason why greedy \u03a3-optimality outperforms V-optimality in the experi-\nments. However, we made two observations during our investigation that provide some insights. An\nillustrative toy example is also provided in Section 5.1.\nObservation 4. Eq. (2.8) and (2.9) suggest that both the greedy \u03a3/V-optimality selects nodes that\n(1) have high variance and (2) are highly correlated to high-variance nodes, conditioned on the\nlabeled nodes. Notice Lemma 7 proves that predictive correlations are always nonnegative.\n\nIn order to contrast \u03a3/V-optimality, rewrite (2.9) as:\n\n((cid:80)\nt\u2208u \u03c1vt\u03c3t)2 =(cid:80)\n\nt +(cid:80)\n\n(\u03a3-optimality) : arg max\n\nv\u2208u\n\nt\u2208u \u03c12\n\nvt\u03c32\n\nt1(cid:54)=t2\u2208u \u03c1vt1\u03c1vt2\u03c3t1 \u03c3t2 .\n\n(3.7)\n\nObservation 5. \u03a3-optimality has one more term that involves cross products of (\u03c1vt1\u03c3t1) and\n(\u03c1vt2 \u03c3t2) (which are nonnegative according to Lemma 9). By the Cauchy\u2013Schwartz Inequality,\nthe sum of these cross products are maximized when they are equal. So, the \u03a3-optimality addition-\nally favors nodes that (3) have consistent global in\ufb02uence, i.e., that are more likely to be in cluster\ncenters.\n\n4 Proof Sketches\n\nOur results predicate on and extend to GPs whose inverse covariance matrix meets Proposition 6.\nProposition 6. L satis\ufb01es the following. 1\n\n#\n\nTextual description\np6.1 L has proper signs.\np6.2 L is undirected and connected.\np6.3 Node degree no less than number of edges.\np6.4 L is nonsingular and positive-de\ufb01nite.\n\nMathematical expression\nlij \u2265 0 if i = j and lij \u2264 0 if i (cid:54)= j.\nj(cid:54)=i(\u2212lij) > 0.\n\nlij = lji\u2200i, j and(cid:80)\nlii \u2265(cid:80)\nj(cid:54)=i(\u2212lij) =(cid:80)\n\u2203i : lii >(cid:80)\nj(cid:54)=i(\u2212lij) =(cid:80)\n\nj(cid:54)=i(\u2212lji) > 0,\u2200i.\nj(cid:54)=i(\u2212lji) > 0.\n\nAlthough the properties of V-optimality fall into the more general class of spectral functions (Fried-\nland & Gaubert, 2011), we have seen no proof of either the suppressor-free condition or the submod-\nularity of \u03a3-optimality on GRFs. We write the ideas behind the proofs. Details are in the appendix.2\nLemma 7. For any L satisfying (p6.1-4), L\u22121 \u2265 0 entry-wise.3\n\nProof. Sketch: Suppose L = D \u2212 W = D(I \u2212 D\u22121W ), with D = diag (L). Then we can show\nthe convergence of the Taylor expansion (Appendix A.1):\n\nL\u22121 = [I +(cid:80)\u221e\n\nr=1(D\u22121W )r]D\u22121.\n\n(4.1)\n\nIt suf\ufb01ces to observe that every term on the right hand side (RHS) is nonnegative.\nCorollary 8. The GRF prediction operator L\u22121\n[0, 1]|u|. When L is singular, the mapping is onto.\n\nu Lul maps y(cid:96) \u2208 [0, 1]|(cid:96)| to \u02c6yu = \u2212L\u22121\n\nu Luly(cid:96) \u2208\n\n1Property p6.4 holds after the \ufb01rst query is done or when the regularizor \u03b4 > 0 in (2.1).\n2Available at http://www.autonlab.org/autonweb/21763.html\n3In the following, for any vector or matrix A, A \u2265 0 always stands for A being (entry-wise) nonnegative.\n\n5\n\n\fProof. For y(cid:96) = 1, (Lu, Lul) \u00b7 1 \u2265 0 and L\u22121\n\u2212L\u22121\nAs both Lu \u2265 0 and \u2212Lul \u2265 0, we have y(cid:96) \u2265 0 \u21d2 \u02c6yu \u2265 0 and y(cid:96) \u2265 y(cid:48)\n\nu Lul1 = \u02c6yu.\n\nu Lul\n\n(cid:96) \u21d2 \u02c6yu \u2265 \u02c6y(cid:48)\nu.\n\n(cid:1) \u00b7 1 \u2265 0, i.e. 1 \u2265\n\nu \u2265 0 imply (cid:0)I, L\u22121\n(cid:18)L\u22121\n(cid:19)\n11 (\u2212L12)\n\n(cid:19)\n\n11\n0\n\n0\n0\n\n. Then L\u22121 \u2212\n\n(cid:18)L11 L12\n(cid:18)L\u22121\n(cid:19)\n\nL21 L22\n\n(cid:19)\n(cid:18)L\u22121\n\n11\n0\n\n0\n0\n\nI\n\nLemma 9. Suppose L =\n\n\u2265 0 and is positive-semide\ufb01nite.\n\nProof. As L\u22121 \u2265 0 and is PSD, the RHS below is term-wise nonnegative and the middle term PSD\n(Appendix A.2): L\u22121\u2212\n\n11 L12)\u22121(cid:0)(\u2212L21)L\u22121\n11 , I(cid:1)\n\n(L22\u2212L21L\u22121\n\n=\n\n(cid:18)Au bu\n\n(cid:19)\n\nbT\nu\n\ncu\n\nL(v\u2212(cid:96)) :=\n\nAs a corollary, the monotonicity in (3.3) for both R(\u00b7) = RV (\u00b7) or R\u03a3(\u00b7) can be shown.\nBoth proofs for submodularity in (3.4) and Theorem 3 result from more careful execution of matrix\ninversions similar to Lemma 9 (detailed in Appendix A.4). We sketch Theorem 3 for example.\nProof. Without loss of generality, let u = v \u2212 (cid:96) = {1, . . . , k}. By Shur\u2019s Lemma (Appendix A.3):\n\n\u21d2 Cov(yi, yk|(cid:96))\nVar(yk|(cid:96))\n\n=\n\n(L\u22121\n(L\u22121\n\n(v\u2212(cid:96)))ik\n(v\u2212(cid:96)))kk\n\n= (A\u22121\n\nu (\u2212bu))i,\u2200i (cid:54)= k\n\n(4.2)\n\n\u2265 A\u22121\n\nwhere the LHS is a reparamatrization with cu being a scaler. Lemma 9 shows that u1 \u2283 u2 \u21d2\nu2 at corresponding entries. Also notice that \u2212bu1 \u2265 \u2212bu2 at corresponding entries and\nA\u22121\nso the RHS of (4.2) is larger with u1. It suf\ufb01ces to draw a similar inequality in the other direction,\nCov(yk, yi|(cid:96))/ Var(yi|(cid:96)).\n\nu1\n\n5 A Toy Example and Some Simulations\n\n5.1 Comparing V-Optimality and \u03a3-Optimality: Active Node Classi\ufb01cation on a Graph\n\nTo visualize the intuitions described in Sec-\ntion 3.1, Figure 1 shows the \ufb01rst few nodes\nselected by different optimality criteria. This\ngraph is constructed by a breadth-\ufb01rst search\nfrom a random node in a larger DBLP coau-\nthorship network graph that we will introduce\nin the next section. On this toy graph, both cri-\nteria pick the same center node to query \ufb01rst.\nHowever, for the second and third queries, V-\noptimality weighs the uncertainty of the can-\ndidate node more, choosing outliers, whereas\n\u03a3-optimality favors nodes with universal in\ufb02u-\nence over the graph and goes to cluster centers.\n\n5.2 Simulating Labels on a Graph\n\nFigure 1: Toy graph demonstrating the behavior\nof \u03a3-optimality vs. V-optimality.\n\nTo further investigate the behavior of \u03a3- and V -\noptimality, we conducted experiments on syn-\nthetic labels generated on real-world network graphs. The node labels were \ufb01rst simulated using the\nmodel in order to compare the active learning criteria directly without raising questions of model \ufb01t.\nWe carry out tests on the same graphs with real data in the next section.\nWe simulated the binary labels with the GRF-sigmoid model and performed active learning with\nthe GRF/LP model for predictions. The parameters in the generation phase were \u03b2 = 0.01 and\n\u03b4 = 0.05, which maximizes the average classi\ufb01cation accuracy increases from 50 random training\nnodes to 200 random training nodes using the GRF/LP model for predictions. Figure 2 shows the\nbinary classi\ufb01cation accuracy versus the number of queries on both the DBLP coauthorship graph\n\n6\n\n class 1 class 2 class 3 \u03a3\u2212optimality V\u2212optimality\f(a) DBLP coauthorship. 68.3% LOO accuracy.\n\n(b) CORA citation. 60.5% LOO accuracy.\n\nFigure 2: Simulating binary labels by the GRF-Sigmoid; learning with the GRF/LP, 480 repetitions.\n\nand the CORA citation graph that we will describe below. The best possible classi\ufb01cation results are\nindicated by the leave-one-out (LOO) accuracies given under each plot.\nFigure 2 can be a surprise due to the reasoning behind the L2 surrogate loss, especially when the\npredictive means are trapped between [\u22121, 1], but we see here that our reasoning in Sections (3.1\nand 5.1) can lead to the greedy survey loss actually making a better active learning objective.\nWe have also performed experiments with different values of \u03b2 and \u03b4. Despite the fact that larger\n\u03b2 and \u03b4 increase label independence on the graph structure and undermine the effectiveness of both\nV/\u03a3-optimality heuristics, we have seen that whenever the V-optimality establishes a superiority\nover random selections, \u03a3-optimality yields better performance.\n\n6 Real-World Experiments\n\nThe active learning heuristics to be compared are:4\n\n1. The new \u03a3-optimality with greedy sequential updates: minv(cid:48)(cid:0)1(cid:62)(Luk\\{v(cid:48)})\u221211(cid:1).\n2. Greedy V-optimality (Ji & Han, 2012): minv(cid:48) tr(cid:0)(Luk\\{v(cid:48)})\u22121(cid:1) .\n3. Mutual information gain (MIG) (Krause et al., 2008): maxv(cid:48)(cid:0)L\u22121\n(cid:104)(cid:16)(cid:80)\n\n4. Uncertainty sampling (US) picking the largest prediction margin: maxv(cid:48) \u02c6y(1)\n5. Expected error reduction (EER) (Settles, 2010; Zhu et al., 2003). Selected nodes maximize\n\n(cid:14)(cid:0)(L(cid:96)k\u222a{v(cid:48)})\u22121(cid:1)\n(cid:17)(cid:12)(cid:12)(cid:12)y(cid:96)k\n(cid:12)(cid:12)(cid:12)yv(cid:48)\n(cid:105)\nv(cid:48) \u2212 \u02c6y(2)\nv(cid:48) .\n\nthe average prediction con\ufb01dence in expectation: maxv(cid:48) Eyv(cid:48)\n\nui\u2208u \u02c6y(1)\n\nui\n\n.\n\n(cid:1)\n\nuk\n\nv(cid:48),v(cid:48)\n\nv(cid:48),v(cid:48)\n\n6. Random selection with 12 repetitions.\n\nComparisons are made on three real-world network graphs.\n\n1. DBLP coauthorship network.5 The nodes represent scholars and the weighted edges are the\nnumber of papers bearing both scholars\u2019 names. The largest connected component has 1711\nnodes and 2898 edges. The node labels were hand assigned in Ji & Han (2012) to one of the\nfour expertise areas of the scholars: machine learning, data mining, information retrieval, and\ndatabases. Each class has around 400 nodes.\n\n2. Cora citation network.6 This is a citation graph of 2708 publications, each of which is classi\ufb01ed\ninto one of seven classes: case based, genetic algorithms, neural networks, probabilistic methods,\nreinforcement learning, rule learning, and theory. The network has 5429 links. We took its\nlargest connected component, with 2485 nodes and 5069 undirected and unweighted edges.\n4Code available at http://www.autonlab.org/autonweb/21763\n5http://www.informatik.uni-trier.de/\u02dcley/db/\n6http://www.cs.umd.edu/projects/linqs/projects/lbc/index.html\n\n7\n\n0501001502000.50.520.540.560.580.60.620.640.660.68number of queriesclassification accuracy \u03a3\u2212optimalityV\u2212optimalityRandom0501001502000.50.520.540.560.580.6number of queriesclassification accuracy \u03a3\u2212optimalityV\u2212optimalityRandom\f(a) DBLP. 84% LOO accuracy.\n\n(b) CORA. 86& LOO accuracy.\n\n(c) CITESEER 76% LOO accuracy.\n\nFigure 3: Classi\ufb01cation accuracy vs the number of queries. \u03b2 = 1, \u03b4 = 0. Randomized \ufb01rst query.\n\n3. CiteSeer citation network.6 This is another citation graph of 3312 publications, each of which\nis classi\ufb01ed into one of six classes: agents, arti\ufb01cial intelligence, databases, information retrieval,\nmachine learning, human computer interaction. The network has 4732 links. We took its largest\nconnected component, with 2109 nodes and 3665 undirected and unweighted edges.\n\nOn all three datasets, \u03a3-optimality outperforms other methods by a large margin especially during\nthe \ufb01rst \ufb01ve to ten queries. The runner-up, EER, catches up to \u03a3-optimality in some cases, but EER\ndoes not have theoretical guarantees.\nThe win of \u03a3-optimality over V-optimality has been intuitively explained in Section 5.1 as \u03a3-\noptimality having better exploration ability and robustness against outliers. The node choices by\nboth criteria were also visually inspected after embedding the graph to the 2-dimensional space us-\ning OpenOrd method developed by Martin et al. (2011). The analysis there was similar to Figure 1.\nWe also performed real-world experiments on the root-mean-square-error of the class proportion es-\ntimations, which is the survey risk that the \u03a3-optimality minimizes. \u03a3-optimality beats V-optimality.\nDetails were omitted for space concerns.\n\n7 Conclusion\n\nFor active learning on GRFs, it is common to use variance minimization criteria with greedy one-\nstep lookahead heuristics. V-optimality and \u03a3-optimality are two criteria based on statistics of the\npredictive covariance matrix. They both are also risk minimization criteria: V-optimality minimizes\nthe L2 risk (2.3), whereas \u03a3-optimality minimizes the survey risk (2.5).\nActive learning with both criteria can be seen as subset optimization problems (2.4), (2.6). Both\nobjective functions are supermodular set functions. Therefore, risk reduction is submodular and the\ngreedy one-step lookahead heuristics can achieve a (1 \u2212 1/e) global optimality ratio. Moreover, we\nhave shown that GRFs serve as a tangible example of the suppressor-free condition.\nWhile the V-optimality on GRFs inherits from label propagation (and random walk with absorptions)\nand have good empirical performance, it is not directly minimizing the 0/1 classi\ufb01cation risk. We\nfound that the \u03a3-optimality performs even better. The intuition is described in Section 5.1.\nFuture work include deeper understanding of the direct motivations behind \u03a3-optimality on the GRF\nclassi\ufb01cation model and extending the GRF to continuous spaces.\n\nAcknowledgments\n\nThis work is funded in part by NSF grant IIS0911032 and DARPA grant FA87501220324.\n\n8\n\n010203040500.20.250.30.350.40.450.50.550.60.65 \u03a3\u2212optV\u2212optRandMIGUncEER010203040500.10.20.30.40.50.60.70.8 \u03a3\u2212optV\u2212optRandMIGUncEER010203040500.20.30.40.50.60.7 \u03a3\u2212optV\u2212optRandMIGUncEER\fReferences\n\nDas, Abhimanyu and Kempe, David. Algorithms for subset selection in linear regression. In Pro-\n\nceedings of the 40th annual ACM symposium on Theory of computing, pp. 45\u201354. ACM, 2008.\n\nFriedland, S and Gaubert, S. Submodular spectral functions of principal submatrices of a hermitian\n\nmatrix, extensions and applications. Linear Algebra and its Applications, 2011.\n\nGarnett, Roman, Krishnamurthy, Yamuna, Xiong, Xuehan, Schneider, Jeff, and Mann, Richard.\n\nBayesian optimal active search and surveying. In ICML, 2012.\n\nJi, Ming and Han, Jiawei. A variance minimization criterion to active learning on graphs. In AISTAT,\n\n2012.\n\nKrause, Andreas, Singh, Ajit, and Guestrin, Carlos. Near-optimal sensor placements in gaussian\nprocesses: Theory, ef\ufb01cient algorithms and empirical studies. Journal of Machine Learning Re-\nsearch (JMLR), 9:235\u2013284, February 2008.\n\nMartin, Shawn, Brown, W Michael, Klavans, Richard, and Boyack, Kevin W. Openord: an open-\nsource toolbox for large graph layout. In IS&T/SPIE Electronic Imaging, pp. 786806\u2013786806.\nInternational Society for Optics and Photonics, 2011.\n\nNemhauser, George L, Wolsey, Laurence A, and Fisher, Marshall L. An analysis of approximations\n\nfor maximizing submodular set functionsi. Mathematical Programming, 14(1):265\u2013294, 1978.\n\nRasmussen, Carl Edward and Williams, Christopher KI. Gaussian processes for machine learning,\n\nvolume 1. MIT press Cambridge, MA, 2006.\n\nSettles, Burr. Active learning literature survey. University of Wisconsin, Madison, 2010.\nWalker, David A. Suppressor variable (s) importance within a regression model: an example of\nsalary compression from career services. Journal of College Student Development, 44(1):127\u2013\n133, 2003.\n\nWu, Xiao-Ming, Li, Zhenguo, So, Anthony Man-Cho, Wright, John, and Chang, Shih-Fu. Learning\nwith partially absorbing random walks. In Advances in Neural Information Processing Systems\n25, pp. 3086\u20133094, 2012.\n\nZhu, Xiaojin and Ghahramani, Zoubin. Learning from labeled and unlabeled data with label prop-\nagation. Technical report, Technical Report CMU-CALD-02-107, Carnegie Mellon University,\n2002.\n\nZhu, Xiaojin, Lafferty, John, and Ghahramani, Zoubin. Combining active learning and semi-\nsupervised learning using gaussian \ufb01elds and harmonic functions. In ICML 2003 workshop on The\nContinuum from Labeled to Unlabeled Data in Machine Learning and Data Mining, pp. 58\u201365,\n2003.\n\n9\n\n\f", "award": [], "sourceid": 1272, "authors": [{"given_name": "Yifei", "family_name": "Ma", "institution": "CMU"}, {"given_name": "Roman", "family_name": "Garnett", "institution": "University of Bonn"}, {"given_name": "Jeff", "family_name": "Schneider", "institution": "CMU"}]}