{"title": "Graphical Models for Inference with Missing Data", "book": "Advances in Neural Information Processing Systems", "page_first": 1277, "page_last": 1285, "abstract": "We address the problem of deciding whether there exists a consistent estimator of a given relation Q, when data are missing not at random. We employ a formal representation called `Missingness Graphs' to explicitly portray the causal mechanisms responsible for missingness and to encode dependencies between these mechanisms and the variables being measured. Using this representation, we define the notion of \\textit{recoverability} which ensures that, for a given missingness-graph $G$ and a given query $Q$ an algorithm exists such that in the limit of large samples, it produces an estimate of $Q$ \\textit{as if} no data were missing. We further present conditions that the graph should satisfy in order for recoverability to hold and devise algorithms to detect the presence of these conditions.", "full_text": "Graphical Models for Inference with Missing Data\n\nKarthika Mohan\n\nJudea Pearl\n\nJin Tian\n\nDept. of Computer Science\n\nDept. of Computer Science\n\nDept. of Computer Science\n\nUniv. of California, Los Angeles Univ. of California, Los Angeles\n\nLos Angeles, CA 90095\n\nkarthika@cs.ucla.edu\n\nLos Angeles, CA 90095\njudea@cs.ucla.edu\n\nIowa State University\n\nAmes, IA 50011\n\njtian@iastate.edu\n\nAbstract\n\nWe address the problem of recoverability i.e. deciding whether there exists a con-\nsistent estimator of a given relation Q, when data are missing not at random. We\nemploy a formal representation called \u2018Missingness Graphs\u2019 to explicitly portray\nthe causal mechanisms responsible for missingness and to encode dependencies\nbetween these mechanisms and the variables being measured. Using this represen-\ntation, we derive conditions that the graph should satisfy to ensure recoverability\nand devise algorithms to detect the presence of these conditions in the graph.\n\n1\n\nIntroduction\n\nThe \u201cmissing data\u201d problem arises when values for one or more variables are missing from recorded\nobservations. The extent of the problem is evidenced from the vast literature on missing data in such\ndiverse \ufb01elds as social science, epidemiology, statistics, biology and computer science. Missing\ndata could be caused by varied factors such as high cost involved in measuring variables, failure of\nsensors, reluctance of respondents in answering certain questions or an ill-designed questionnaire.\nMissing data also plays a major role in survival data analysis and has been treated primarily using\nKaplan-Meier estimation [30].\nIn machine learning, a typical example is the Recommender System [16] that automatically gen-\nerates a list of products that are of potential interest to a given user from an incomplete dataset of\nuser ratings. Online portals such as Amazon and eBay employ such systems. Other areas such as\ndata mining [7], knowledge discovery [18] and network tomography [2] are also plagued by miss-\ning data problems. Missing data can have several harmful consequences [23, 26]. Firstly they can\nsigni\ufb01cantly bias the outcome of research studies. This is mainly because the response pro\ufb01les of\nnon-respondents and respondents can be signi\ufb01cantly different from each other. Hence ignoring\nthe former distorts the true proportion in the population. Secondly, performing the analysis using\nonly complete cases and ignoring the cases with missing values can reduce the sample size thereby\nsubstantially reducing estimation ef\ufb01ciency. Finally, many of the algorithms and statistical tech-\nniques are generally tailored to draw inferences from complete datasets. It may be dif\ufb01cult or even\ninappropriate to apply these algorithms and statistical techniques on incomplete datasets.\n\n1.1 Existing Methods for Handling Missing Data\n\nThere are several methods for handling missing data, described in a rich literature of books, articles\nand software packages, which are brie\ufb02y summarized here1. Of these, listwise deletion and pairwise\ndeletion are used in approximately 96% of studies in the social and behavioral sciences [24].\nListwise deletion refers to a simple method in which cases with missing values are deleted [3]. Un-\nless data are missing completely at random, listwise deletion can bias the outcome [31]. Pairwise\n\n1For detailed discussions we direct the reader to the books- [1, 6, 13, 17].\n\n1\n\n\fdeletion (or \u201cavailable case\u201d) is a deletion method used for estimating pairwise relations among vari-\nables. For example, to compute the covariance of variables X and Y , all those cases or observations\nin which both X and Y are observed are used, regardless of whether other variables in the dataset\nhave missing values.\nThe expectation-maximization (EM) algorithm is a general technique for \ufb01nding maximum like-\nlihood (ML) estimates from incomplete data. It has been proven that likelihood-based inference\nwhile ignoring the missing data mechanism, leads to unbiased estimates under the assumption of\nmissing at random (MAR) [13]. Most work in machine learning assumes MAR and proceeds with\nML or Bayesian inference. Exceptions are recent works on collaborative \ufb01ltering and recommender\nsystems which develop probabilistic models that explicitly incorporate missing data mechanism\n[16, 14, 15]. ML is often used in conjunction with imputation methods, which in layman terms,\nsubstitutes a reasonable guess for each missing value [1]. A simple example is Mean Substitution, in\nwhich all missing observations of variable X are substituted with the mean of all observed values of\nX. Hot-deck imputation, cold-deck imputation [17] and Multiple Imputation [26, 27] are examples\nof popular imputation procedures. Although these techniques work well in practice, performance\nguarantees (eg: convergence and unbiasedness) are based primarily on simulation experiments.\nMissing data discussed so far is a special case of coarse data, namely data that contains observations\nmade in the power set rather than the sample space of variables of interest [12]. The notion of coars-\nening at random (CAR) was introduced in [12] and identi\ufb01es the condition under which coarsening\nmechanism can be ignored while drawing inferences on the distribution of variables of interest [10].\nThe notion of sequential CAR has been discussed in [9]. For a detailed discussion on coarsened data\nrefer to [30].\nMissing data literature leaves many unanswered questions with regard to theoretical guarantees for\nthe resulting estimates, the nature of the assumptions that must be made prior to employing various\nprocedures and whether the assumptions are testable. For a gentle introduction to the missing data\nproblem and the issue of testability refer to [22, 19]. This paper aims to illuminate missing data\nproblems using causal graphs [See Appendix 5.2 for justi\ufb01cation]. The questions we pose are:\nGiven a target relation Q to be estimated and a set of assumptions about the missingness process\nencoded in a graphical model, under what conditions does a consistent estimate exist and how can\nwe elicit it from the data available?\nWe answer these questions with the aid of Missingness Graphs (m-graphs in short) to be described\nin Section 2. Furthermore, we review the traditional taxonomy of missing data problems and cast it\nin graphical terms. In Section 3 we de\ufb01ne the notion of recoverability - the existence of a consistent\nestimate - and present graphical conditions for detecting recoverability of a given probabilistic query\nQ. Conclusions are drawn in Section 4.\n\n2 Graphical Representation of the Missingness Process\n\n2.1 Missingness Graphs\n\nFigure 1: m-graphs for data that are: (a) MCAR, (b) MAR, (c) & (d) MNAR; Hollow and solid\ncircles denote partially and fully observed variables respectively.\n\nGraphical models such as DAGs (Directed Acyclic Graphs) can be used for encoding as well as\nportraying conditional independencies and causal relations, and the graphical criterion called d-\nseparation (refer Appendix-5.1, De\ufb01nition-3) can be used to read them off the graph [21, 20]. Graph-\nical Models have been used to analyze missing information in the form of missing cases (due to\nsample selection bias)[4]. Using causal graphs, [8]- analyzes missingness due to attrition (partially\n\n2\n\nY*Ry(a)XYY*RyXY(b)Y**XRyRx(c)XYY**XRyRxXY(d)\fobserved outcome) and [29]- cautions against the indiscriminate use of auxiliary variables. In both\npapers missing values are associated with one variable and interactions among several missingness\nmechanisms remain unexplored.\nThe need exists for a general approach capable of modeling an arbitrary data-generating process and\ndeciding whether (and how) missingness can be outmaneuvered in every dataset generated by that\nprocess. Such a general approach should allow each variable to be governed by its own missingness\nmechanism, and each mechanism to be triggered by other (potentially) partially observed variables\nin the model. To achieve this \ufb02exibility we use a graphical model called \u201cmissingness graph\u201d (m-\ngraph, for short) which is a DAG (Directed Acyclic Graph) de\ufb01ned as follows.\nLet G(V, E) be the causal DAG where V = V \u222a U \u222a V \u2217 \u222a R. V is the set of observable nodes.\nNodes in the graph correspond to variables in the data set. U is the set of unobserved nodes (also\ncalled latent variables). E is the set of edges in the DAG. Oftentimes we use bi-directed edges as\na shorthand notation to denote the existence of a U variable as common parent of two variables in\nVo \u222a Vm \u222a R. V is partitioned into Vo and Vm such that Vo \u2286 V is the set of variables that are\nobserved in all records in the population and Vm \u2286 V is the set of variables that are missing in\nat least one record. Variable X is termed as fully observed if X \u2208 Vo and partially observed if\nX \u2208 Vm.\nAssociated with every partially observed variable Vi \u2208 Vm are two other variables Rvi and V \u2217\ni ,\nwhere V \u2217\nis a proxy variable that is actually observed, and Rvi represents the status of the causal\nmechanism responsible for the missingness of V \u2217\n\ni\n\ni ; formally,\n\n(cid:26) vi\n\nm\n\nv\u2217\ni = f (rvi, vi) =\n\nif rvi = 0\nif rvi = 1\n\n(1)\n\nContrary to conventional use, Rvi is not treated merely as the missingness indicator but as a driver\n(or a switch) that enforces equality between Vi and V \u2217\ni . V \u2217 is a set of all proxy variables and\nR is the set of all causal mechanisms that are responsible for missingness. R variables may not\nbe parents of variables in V \u222a U. This graphical representation succinctly depicts both the causal\nrelationships among variables in V and the process that accounts for missingness in some of the\nvariables. We call this graphical representation Missingness Graph or m-graph for short. Since\nevery d-separation in the graph implies conditional independence in the distribution [21], the m-\ngraph provides an effective way of representing the statistical properties of the missingness process\nand, hence, the potential of recovering the statistics of variables in Vm from partially missing data.\n\n2.2 Taxonomy of Missingness Mechanisms\n\nIt is common to classify missing data mechanisms into three types [25, 13]:\nMissing Completely At Random (MCAR) : Data are MCAR if the probability that Vm is missing\nis independent of Vm or any other variable in the study, as would be the case when respondents\ndecide to reveal their income levels based on coin-\ufb02ips.\nMissing At Random (MAR) : Data are MAR if for all data cases Y , P (R|Yobs, Ymis) = P (R|Yobs)\nwhere Yobs denotes the observed component of Y and Ymis, the missing component. Example:\nWomen in the population are more likely to not reveal their age.\nMissing Not At Random (MNAR) or \u201cnon-ignorable missing\u201d: Data that are neither MAR nor\nMCAR are termed as MNAR. Example: Online shoppers rate an item with a high probability either\nif they love the item or if they loathe it. In other words, the probability that a shopper supplies a\nrating is dependent on the shopper\u2019s underlying liking [16].\nBecause it invokes speci\ufb01c values of the observed and unobserved variables, (i.e., Yobs and Ymis),\nmany authors \ufb01nd Rubin\u2019s de\ufb01nition dif\ufb01cult to apply in practice and prefer to work with de\ufb01nitions\nexpressed in terms of independencies among variables (see [28, 11, 6, 17]). In the graph-based\ninterpretation used in this paper, MCAR is de\ufb01ned as total independence between R and Vo\u222aVm\u222aU\ni.e. R\u22a5\u22a5(Vo \u222a Vm \u222a U ), as depicted in Figure 1(a). MAR is de\ufb01ned as independence between R and\nVm\u222aU given Vo i.e. R\u22a5\u22a5Vm\u222aU|Vo, as depicted in Figure 1(b). Finally if neither of these conditions\nhold, data are termed MNAR, as depicted in Figure 1(c) and (d). This graph-based interpretation uses\nslightly stronger assumptions than Rubin\u2019s, with the advantage that the user can comprehend, encode\nand communicate the assumptions that determine the classi\ufb01cation of the problem. Additionally, the\nconditional independencies that de\ufb01ne each class are represented explicitly as separation conditions\n\n3\n\n\fin the corresponding m-graphs. We will use this taxonomy in the rest of the paper, and will label\ndata MCAR, MAR and MNAR according to whether the de\ufb01ning conditions, R\u22a5\u22a5Vo \u222a Vm \u222a U (for\nMCAR), R\u22a5\u22a5Vm \u222a U|Vo (for MAR) are satis\ufb01ed in the corresponding m-graphs.\n\n3 Recoverability\n\nIn this section we will examine the conditions under which a bias-free estimate of a given proba-\nbilistic relation Q can be computed. We shall begin by de\ufb01ning the notion of recoverability.\nDe\ufb01nition 1 (Recoverability). Given a m-graph G, and a target relation Q de\ufb01ned on the variables\nin V , Q is said to be recoverable in G if there exists an algorithm that produces a consistent estimate\nof Q for every dataset D such that P (D) is (1) compatible with G and (2) strictly positive over\ncomplete cases i.e. P (Vo, Vm, R = 0) > 0.2\nHere we assume that the observed distribution over complete cases P (Vo, Vm, R = 0) is strictly\npositive, thereby rendering recoverability a property that can be ascertained exclusively from the\nm-graph.\nCorollary 1. A relation Q is recoverable in G if and only if Q can be expressed in terms of the\nprobability P (O) where O = {R, V \u2217, Vo} is the set of observable variables in G. In other words,\nfor any two models M1 and M2 inducing distributions P M1 and P M2 respectively, if P M1(O) =\nP M2(O) > 0 then QM1 = QM2.\n\nProof: (sketch) The corollary merely rephrases the requirement of obtaining a consistent estimate to\nthat of expressibility in terms of observables.\nPractically, what recoverability means is that if the data D are generated by any process compatible\nwith G, a procedure exists that computes an estimator \u02c6Q(D) such that, in the limit of large samples,\n\u02c6Q(D) converges to Q. Such a procedure is called a \u201cconsistent estimator.\u201d Thus, recoverability is\nthe sole property of G and Q, not of the data available, or of any routine chosen to analyze or process\nthe data.\nRecoverability when data are MCAR For MCAR data we have R\u22a5\u22a5(Vo \u222a Vm). Therefore, we\ncan write P (V ) = P (V |R) = P (Vo, V \u2217|R = 0). Since both R and V \u2217 are observables, the joint\nprobability P (V ) is consistently estimable (hence recoverable) by considering complete cases only\n(listwise deletion), as shown in the following example.\nExample 1. Let X be the treatment and Y be the outcome as depicted in the m-graph in Fig. 1\n(a). Let it be the case that we accidentally deleted the values of Y for a handful of samples, hence\nY \u2208 Vm. Can we recover P (X, Y )?\nFrom D, we can compute P (X, Y \u2217, Ry). From the m-graph G, we know that Y \u2217 is a collider and\nhence by d-separation, (X \u222a Y )\u22a5\u22a5Ry. Thus P (X, Y ) = P (X, Y |Ry). In particular, P (X, Y ) =\nP (X, Y |Ry = 0). When Ry = 0, by eq. (1), Y \u2217 = Y . Hence,\n\nP (X, Y ) = P (X, Y \u2217|Ry = 0)\n\n(2)\n\nThe RHS of Eq. 2 is consistently estimable from D; hence P (X, Y ) is recoverable.\nRecoverability when data are MAR When data are MAR, we have R\u22a5\u22a5Vm|Vo. Therefore\nP (V ) = P (Vm|Vo)P (Vo) = P (Vm|Vo, R = 0)P (Vo). Hence the joint distribution P (V ) is re-\ncoverable.\nExample 2. Let X be the treatment and Y be the outcome as depicted in the m-graph in Fig. 1 (b).\nLet it be the case that some patients who underwent treatment are not likely to report the outcome,\nhence the arrow X \u2192 Ry. Under the circumstances, can we recover P (X, Y )?\nFrom D, we can compute P (X, Y \u2217, Ry). From the m-graph G, we see that Y \u2217 is a collider and X is\na fork. Hence by d-separation, Y \u22a5\u22a5Ry|X. Thus P (X, Y ) = P (Y |X)P (X) = P (Y |X, Ry)P (X).\n2In many applications such as truncation by death, the problem forbids certain combinations of events\nfrom occurring, in which case the de\ufb01nition need be modi\ufb01ed to accommodate such constraints as shown in\nAppendix-5.3. Though this modi\ufb01cation complicates the de\ufb01nition of \u201crecoverability\u201d, it does not change the\nbasic results derived in this paper.\n\n4\n\n\fIn particular, P (X, Y ) = P (Y |X, Ry = 0)P (X). When Ry = 0, by eq. (1), Y \u2217 = Y . Hence,\n\nP (X, Y ) = P (Y \u2217|X, Ry = 0)P (X)\n\n(3)\n\nand since X is fully observable, P (X, Y ) is recoverable.\n\nNote that eq. (2) permits P (X, Y ) to be recovered by listwise deletion, while eq. (3) does not; it\nrequires that P (X) be estimated \ufb01rst over all samples, including those in which Y is missing. In\nthis paper we focus on recoverability under large sample assumption and will not be dealing with\nthe shrinking sample size issue.\n\nRecoverability when data are MNAR Data that are neither MAR nor MCAR are termed MNAR.\nThough it is generally believed that relations in MNAR datasets are not recoverable, the following\nexample demonstrates otherwise.\nExample 3. Fig. 1 (d) depicts a study where (i) some units who underwent treatment (X = 1) did\nnot report the outcome (Y ) and (ii) we accidentally deleted the values of treatment for a handful\nof cases. Thus we have missing values for both X and Y which renders the dataset MNAR. We\nshall show that P (X, Y ) is recoverable. From D, we can compute P (X\u2217, Y \u2217, Rx, Ry). From the\nm-graph G, we see that X\u22a5\u22a5Rx and Y \u22a5\u22a5(Rx \u222a Ry)|X. Thus P (X, Y ) = P (Y |X)P (X) =\nP (Y |X, Ry = 0, Rx = 0)P (X|Rx = 0). When Ry = 0 and Rx = 0 we have (by Equation (1) ),\nY \u2217 = Y and X\u2217 = X. Hence,\n\nP (X, Y ) = P (Y \u2217|X\u2217, Rx = 0, Ry = 0)P (X\u2217|Rx = 0)\n\n(4)\n\nTherefore, P (X, Y ) is recoverable.\n\nThe estimand in eq. (4) also dictates how P (X, Y ) should be estimated from the dataset. In the \ufb01rst\nstep, we delete all cases in which X is missing and create a new data set D(cid:48) from which we estimate\nP (X). Dataset D(cid:48) is further pruned to form dataset D(cid:48)(cid:48) by removing all cases in which Y is missing.\nP (Y |X) is then computed from D(cid:48)(cid:48). Note that order matters; had we deleted cases in the reverse\norder, Y and then X, the resulting estimate would be biased because the d-separations needed for\nestablishing the validity of the estimand: P (X|Y )P (Y ), are not supported by G. We will call this\nsequence of deletions as deletion order.\nSeveral features are worth noting regarding this graph-based taxonomy of missingness mechanisms.\nFirst, although MCAR and MAR can be veri\ufb01ed by inspecting the m-graph, they cannot, in general\nbe veri\ufb01ed from the data alone. Second, the assumption of MCAR allows an estimation procedure\nthat amounts (asymptotically) to listwise deletion, while MAR dictates a procedure that amounts\nto listwise deletion in every stratum of Vo. Applying MAR procedure to MCAR problem is safe,\nbecause all conditional independencies required for recoverability under the MAR assumption also\nhold in an MCAR problem, i.e. R\u22a5\u22a5(Vo, Vm) \u21d2 R\u22a5\u22a5Vm|Vo. The converse, however, does not\nhold, as can be seen in Fig. 1 (b). Applying listwise deletion is likely to result in bias, because the\nnecessary condition R\u22a5\u22a5(Vo, Vm) is violated in the graph. An interesting property which evolves\nfrom this discussion is that recoverability of certain relations does not require RVi\u22a5\u22a5Vi|Vo ; a subset\nof Vo would suf\ufb01ce as shown below.\nProperty 1. P (Vi) is recoverable if \u2203W \u2286 Vo such that RVi\u22a5\u22a5V |W .\n\nProof: P (Vi) may be decomposed as: P (Vi) =(cid:80)\n\nand W \u2286 Vo. Hence P (Vi) is recoverable.\nIt is important to note that the recoverability of P (X, Y ) in Fig. 1(d) was feasible despite the fact\nthat the missingness model would not be considered Rubin\u2019s MAR (as de\ufb01ned in [25]). In fact, an\noverwhelming majority of the data generated by each one of our MNAR examples would be outside\nRubin\u2019s MAR. For a brief discussion on these lines, refer to Appendix- 5.4.\nOur next question is: how can we determine if a given relation is recoverable? The following\ntheorem provides a suf\ufb01cient condition for recoverability.\n\nw P (V \u2217\n\ni |Rvi = 0, W )P (W ) since Vi\u22a5\u22a5RVi|W\n\n3.1 Conditions for Recoverability\nTheorem 1. A query Q de\ufb01ned over variables in Vo \u222a Vm is recoverable if it is decomposable into\nterms of the form Qj = P (Sj|Tj) such that Tj contains the missingness mechanism Rv = 0 of\nevery partially observed variable V that appears in Qj.\n\n5\n\n\fProof: If such a decomposition exists, every Qj is estimable from the data, hence the entire expres-\nsion for Q is recoverable.\nExample 4. Equation (4) demonstrates a decomposition of Q = P (X, Y ) into a product of two\nterms Q1 = P (Y |X, Rx = 0, Ry = 0) and Q2 = P (X|Rx = 0) that satisfy the condition of\nTheorem 1. Hence Q is recoverable.\nExample 5. Consider the problem of recovering Q = P (X, Y ) from the m-graph of Fig. 3(b).\nAttempts to decompose Q by the chain rule, as was done in Eqs. (3) and (4) would not satisfy the\nconditions of Theorem 1. To witness we write P (X, Y ) = P (Y |X)P (X) and note that the graph\ndoes not permit us to augment any of the two terms with the necessary Rx or Ry terms; X is\nindependent of Rx only if we condition on Y , which is partially observed, and Y is independent of\nRy only if we condition on X which is also partially observed. This deadlock can be disentangled\nhowever using a non-conventional decomposition:\n\nQ = P (X, Y ) = P (X, Y )\n\nP (Rx, Ry|X, Y )\nP (Rx, Ry|X, Y )\nP (Rx, Ry)P (X, Y |Rx, Ry)\nP (Rx|Y, Ry)P (Ry|X, Rx)\n\n=\n\nthe denominator was obtained using the\n\n(5)\nindependencies Rx\u22a5\u22a5(X, Ry)|Y and\nwhere\nRy\u22a5\u22a5(Y, Rx)|X shown in the graph.\nThe \ufb01nal expression above satis\ufb01es Theorem 1 and\nrenders P (X, Y ) recoverable. This example again shows that recovery is feasible even when data\nare MNAR.\nTheorem 2 operationalizes the decomposability requirement of Theorem 1.\nTheorem 2 (Recoverability of the Joint P (V )). Given a m-graph G with no edges between the R\nvariables and no latent variables as parents of R variables, a necessary and suf\ufb01cient condition\nfor recovering the joint distribution P (V ) is that no variable X be a parent of its missingness\nmechanism RX. Moreover, when recoverable, P (V ) is given by\n\n(cid:81)\ni P (Ri = 0|pao\n\nP (R = 0, v)\n, pam\nri\n\nri\n\nP (v) =\n\n,\n\n= 0)\n\n, RP am\nri\n\nwhere P ao\nri\n\n\u2286 Vo and P am\n\nri\n\n\u2286 Vm are the parents of Ri.\n\n(6)\n\n(7)\n\nProof. (suf\ufb01ciency) The observed joint distribution may be decomposed according to G as\n\nP (R = 0, v) =\n\nP (v, u)P (R = 0|v, u)\n\n(cid:88)\n\nu\n\n= P (v)\n\n(cid:89)\n\ni\n\nP (Ri = 0|pao\n\nri\n\n, pam\nri\n\n),\n\nri\n\nri\n\n\u222a P am\nP (Ri = 0|pao\n\n). Therefore,\n, pam\nri\n\nri\n\nri\n\nwhere we have used the facts that there are no edges between the R variables, and that there are no\nlatent variables as parents of R variables. If Vi is not a parent of Ri (i.e. Vi (cid:54)\u2208 P am\n), then we have\nRi\u22a5\u22a5RP am\n\n|(P ao\n\nri\n\n) = P (Ri = 0|pao\n\n, pam\nri\n\n(8)\nGiven strictly positive P (R = 0, Vm, Vo), we have that all probabilities P (Ri =\n0|pao\n= 0) are strictly positive. Using Equations (7) and (8) , we conclude that\nP (V ) is recoverable as given by Eq. (6).\n(necessity) If X is a parent of its missingness mechanism RX, then P (X) is not recoverable based\non Lemmas 3 and 4 in Appendix 5.5. Therefore the joint P (V ) is not recoverable.\n\n, RP am\nri\n\n, RP am\nri\n\n, pam\nri\n\n= 0).\n\nri\n\nri\n\nThe following theorem gives a suf\ufb01cient condition for recovering the joint distribution in a Marko-\nvian model.\nTheorem 3. Given a m-graph with no latent variables (i.e., Markovian) the joint distribution P (V )\nis recoverable if no missingness mechanism RX is a descendant of its corresponding variable X.\nMoreover, if recoverable, then P (V ) is given by\n\n(cid:89)\nP (vi|pao\ni,Vi\u2208Vo\ni \u2286 Vo and P am\n\n(cid:89)\ni , pam\nj,Vj\u2208Vm\ni \u2286 Vm are the parents of Vi.\n\ni , RP am\n\n= 0)\n\ni\n\nP (v) =\n\nwhere P ao\n\nP (vj|pao\n\nj , pam\n\nj , RVj = 0, RP am\n\nj\n\n= 0),\n\n(9)\n\n6\n\n\fProof: Refer Appendix-5.6\n(cid:81)\nDe\ufb01nition 2 (Ordered factorization). An ordered factorization over a set O of ordered V vari-\nables Y1 < Y2 < . . . < Yk, denoted by f (O), is a product of conditional probabilities f (O) =\ni P (Yi|Xi) where Xi \u2286 {Yi+1, . . . , Yn} is a minimal set such that Yi\u22a5\u22a5({Yi+1, . . . , Yn}\\Xi)|Xi.\nTheorem 4. A suf\ufb01cient condition for recoverability of a relation Q is that Q be decomposable into\nan ordered factorization, or a sum of such factorizations, such that every factor Qi = P (Yi|Xi)\nsatis\ufb01es Yi\u22a5\u22a5(Ryi, Rxi )|Xi. A factorization that satis\ufb01es this condition will be called admissible.\n\nFigure 2: Graph in which (a) only P (X|Y ) is recoverable (b) P (X4) is recoverable only when\nconditioned on X1 as shown in Example 6 (c) P (X, Y, Z) is recoverable (d) P (X, Z) is recoverable.\n\nProof. follows from Theorem-1 noting that ordered factorization is one speci\ufb01c form of decompo-\nsition.\n\nTheorem 4 will allow us to con\ufb01rm recoverability of certain queries Q in models such as those in\nFig. 2(a), (b) and (d), which do not satisfy the requirement in Theorem 2. For example, by applying\nTheorem 4 we can conclude that, (1) in Figure 2 (a), P (X|Y ) = P (X|Rx = 0, Ry = 0, Y ) is\nrecoverable, (2) in Figure 2 (c), P (X, Y, Z) = P (Z|X, Y, Rz = 0, Rx = 0, Ry = 0)P (X|Y, Rx =\n0, Ry = 0)P (Y |Ry = 0) is recoverable and (3) in Figure 2 (d), P (X, Z) = P (X, Z|Rx = 0, Rz =\n0) is recoverable.\nNote that the condition of Theorem 4 differs from that of Theorem 1 in two ways. Firstly, the\ndecomposition is limited to ordered factorizations i.e. Yi is a singleton and Xi a set. Secondly, both\nYi and Xi are taken from Vo \u222a Vm, thus excluding R variables.\nExample 6. Consider the query Q = P (X4) in Fig. 2(b). Q can be decomposed in a variety of\nways, among them being the factorizations:\n\n(a) P (X4) =(cid:80)\n(b) P (X4) =(cid:80)\n(c) P (X4) =(cid:80)\nsatis\ufb01es Theorem 4. Speci\ufb01cally, (c) can be written as(cid:80)\n\nP (X4|X3)P (X3) for the order X4, X3\nP (X4|X2)P (X2) for the order X4, X2\nP (X4|X1)P (X1) for the order X4, X1\n\nx1\n\nx3\n\nx2\n\nAlthough each of X1, X2 and X3 d-separate X4 from RX4, only (c) is admissible since each factor\n4|X1, RX4 = 0)P (X1) and can\nbe estimated by the deletion schedule (X1, X4), i.e., in each stratum of X1, we delete samples for\nwhich RX4 = 1 and compute P (X\u2217\n4 , Rx4 = 0, X1). In (a) and (b) however, Theorem-4 is not\nsatis\ufb01ed since the graph does not permit us to rewrite P (X3) as P (X3|Rx3 = 0) or P (X2) as\nP (X2|Rx2 = 0).\n\nP (X\u2217\n\nx1\n\n3.2 Heuristics for Finding Admissible Factorization\n\nConsider the task of estimating Q = P (X), where X is a set, by searching for an admissible\nfactorization of P (X) (one that satis\ufb01es Theorem 4), possibly by resorting to additional variables,\nZ, residing outside of X that serve as separating sets. Since there are exponentially large number\nof ordered factorizations, it would be helpful to rule out classes of non-admissible ordering prior\nto their enumeration whenever non-admissibility can be detected in the graph. In this section, we\nprovide lemmata that would aid in pruning process by harnessing information from the graph.\nLemma 1. An ordered set O will not yield an admissible decomposition if there exists a partially\nobserved variable Vi in the order O which is not marginally independent of RVi such that all minimal\nseparators (refer Appendix-5.1, De\ufb01nition-4) of Vi that d-separate it from Rvi appear before Vi.\n\nProof: Refer Appendix-5.7\n\n7\n\nX1X3X2X4RXRY(b)(a)x3RRZRYRXZXY(c)(d)RYRZRXx4x2RRXYXYZ\fFigure 3: demonstrates (a) pruning in Example-7 (b) P (X, Y ) is recoverable in Example-5\n\nApplying lemma-1 requires a solution to a set of disjunctive constraints which can be represented\nby directed constraint graphs [5].\nExample 7. Let Q = P (X) be the relation to be recovered from the graph in Fig. 3 (a). Let\nX = {A, B, C, D, E} and Z = F . The total number of ordered factorizations is 6! = 720.\nThe independencies implied by minimal separators (as required by Lemma-1) are: A\u22a5\u22a5RA|B,\nB\u22a5\u22a5RB|\u03c6, C\u22a5\u22a5RC|{D, E}, ( D\u22a5\u22a5RD|A or D\u22a5\u22a5RD|C or D\u22a5\u22a5RD|B ) and (E\u22a5\u22a5RE|{B, F} or\nE\u22a5\u22a5RE|{B, D} or E\u22a5\u22a5RE|C). To test whether (B,A,D,E,C,F) is potentially admissible we need\nnot explicate all 6 variables; this order can be ruled out as soon as we note that A appears after B.\nSince B is the only minimal separator that d-separates A from RA and B precedes A, Lemma-1 is\nviolated. Orders such as (C, D, E, A, B, F ), (C, D, A, E, B, F ) and (C, E, D, A, F, B) satisfy the\ncondition stated in Lemma 1 and are potential candidates for admissibility.\n\nThe following lemma presents a simple test to determine non-admissibility by specifying the condi-\ntion under which a given order can be summarily removed from the set of candidate orders that are\nlikely to yield admissible factorizations.\nLemma 2. An ordered set O will not yield an admissible decomposition if it contains a partially\nobserved variable Vi for which there exists no set S \u2286 V that d-separates Vi from RVi.\nProof: The factor P (Vi|Vi+1, . . . , Vn) corresponding to Vi can never satisfy the condition required\nby Theorem 4.\nAn interesting consequence of Lemma 2 is the following corollary that gives a suf\ufb01cient condition\nunder which no ordered factorization can be labeled admissible.\nCorollary 2. For any disjoint sets X and Y , there exists no admissible factorization for recovering\nthe relation P (Y |X) by Theorem 4 if Y contains a partially observed variable Vi for which there\nexists no set S \u2286 V that d-separates Vi from RVi.\n\n4 Conclusions\n\nWe have demonstrated that causal graphical models depicting the data generating process can serve\nas a powerful tool for analyzing missing data problems and determining (1) if theoretical imped-\niments exist to eliminating bias due to data missingness, (2) whether a given procedure produces\nconsistent estimates, and (3) whether such a procedure can be found algorithmically. We formalized\nthe notion of recoverability and showed that relations are always recoverable when data are missing\nat random (MCAR or MAR) and, more importantly, that in many commonly occurring problems,\nrecoverability can be achieved even when data are missing not at random (MNAR). We further\npresented a suf\ufb01cient condition to ensure recoverability of a given relation Q (Theorem 1) and oper-\nationalized Theorem 1 using graphical criteria (Theorems 2, 3 and 4). In summary, we demonstrated\nsome of the insights and capabilities that can be gained by exploiting causal knowledge in missing\ndata problems.\nAcknowledgment\nThis research was supported in parts by grants from NSF #IIS-1249822 and #IIS-1302448 and ONR\n#N00014-13-1-0153 and #N00014-10-1-0933\n\nReferences\n[1] P.D. Allison. Missing data series: Quantitative applications in the social sciences, 2002.\n\n8\n\nRCRARERBRDRYRX(a)(b)ACDEBFYX\f[2] T. Bu, N. Duf\ufb01eld, F.L. Presti, and D. Towsley. Network tomography on general topologies. In ACM\n\nSIGMETRICS Performance Evaluation Review, volume 30, pages 21\u201330. ACM, 2002.\n\n[3] E.R. Buhi, P. Goodson, and T.B. Neilands. Out of sight, not out of mind: strategies for handling missing\n\ndata. American journal of health behavior, 32:83\u201392, 2008.\n\n[4] R.M. Daniel, M.G. Kenward, S.N. Cousens, and B.L. De Stavola. Using causal diagrams to guide analysis\n\nin missing data problems. Statistical Methods in Medical Research, 21(3):243\u2013256, 2012.\n\n[5] R. Dechter, I. Meiri, and J. Pearl. Temporal constraint networks. Arti\ufb01cial intelligence, 1991.\n[6] C.K. Enders. Applied Missing Data Analysis. Guilford Press, 2010.\n[7] U.M. Fayyad. Data mining and knowledge discovery: Making sense out of data. IEEE expert, 11(5):20\u2013\n\n25, 1996.\n\n[8] F. M. Garcia. De\ufb01nition and diagnosis of problematic attrition in randomized controlled experiments.\n\nWorking paper, April 2013. Available at SSRN: http://ssrn.com/abstract=2267120.\n\n[9] R.D. Gill and J.M. Robins. Sequential models for coarsening and missingness. In Proceedings of the\n\nFirst Seattle Symposium in Biostatistics, pages 295\u2013305. Springer, 1997.\n\n[10] R.D. Gill, M.J. Van Der Laan, and J.M. Robins. Coarsening at random: Characterizations, conjec-\ntures, counter-examples. In Proceedings of the First Seattle Symposium in Biostatistics, pages 255\u2013294.\nSpringer, 1997.\n\n[11] J.W Graham. Missing Data: Analysis and Design (Statistics for Social and Behavioral Sciences).\n\nSpringer, 2012.\n\n[12] D.F. Heitjan and D.B. Rubin. Ignorability and coarse data. The Annals of Statistics, pages 2244\u20132253,\n\n1991.\n\n[13] R.J.A. Little and D.B. Rubin. Statistical analysis with missing data. Wiley, 2002.\n[14] B.M. Marlin and R.S. Zemel. Collaborative prediction and ranking with non-random missing data. In\n\nProceedings of the third ACM conference on Recommender systems, pages 5\u201312. ACM, 2009.\n\n[15] B.M. Marlin, R.S. Zemel, S. Roweis, and M. Slaney. Collaborative \ufb01ltering and the missing at random\n\nassumption. In UAI, 2007.\n\n[16] B.M. Marlin, R.S. Zemel, S.T. Roweis, and M. Slaney. Recommender systems: missing data and statisti-\n\ncal model estimation. In IJCAI, 2011.\n\n[17] P.E. McKnight, K.M. McKnight, S. Sidani, and A.J. Figueredo. Missing data: A gentle introduction.\n\nGuilford Press, 2007.\n\n[18] Harvey J Miller and Jiawei Han. Geographic data mining and knowledge discovery. CRC, 2009.\n[19] K. Mohan and J. Pearl. On the testability of models with missing data. To appear in the Proceedings of\n\nAISTAT-2014; Available at http://ftp.cs.ucla.edu/pub/stat ser/r415.pdf.\n\n[20] J. Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kauf-\n\nmann, 1988.\n\n[21] J. Pearl. Causality: models, reasoning and inference. Cambridge Univ Press, New York, 2009.\n[22] J. Pearl and K. Mohan. Recoverability and testability of missing data: Introduction and summary of\n\nresults. Technical Report R-417, UCLA, 2013. Available at http://ftp.cs.ucla.edu/pub/stat ser/r417.pdf.\n\n[23] C.Y.J. Peng, M. Harwell, S.M. Liou, and L.H. Ehman. Advances in missing data methods and implications\n\nfor educational research. Real data analysis, pages 31\u201378, 2006.\n\n[24] J.L. Peugh and C.K. Enders. Missing data in educational research: A review of reporting practices and\n\nsuggestions for improvement. Review of educational research, 74(4):525\u2013556, 2004.\n\n[25] D.B. Rubin. Inference and missing data. Biometrika, 63:581\u2013592, 1976.\n[26] D.B. Rubin. Multiple Imputation for Nonresponse in Surveys. Wiley Online Library, New York, NY,\n\n1987.\n\n[27] D.B. Rubin. Multiple imputation after 18+ years.\n\n91(434):473\u2013489, 1996.\n\nJournal of the American Statistical Association,\n\n[28] J.L. Schafer and J.W. Graham. Missing data: our view of the state of the art. Psychological Methods,\n\n7(2):147\u2013177, 2002.\n\n[29] F. Thoemmes and N. Rose. Selection of auxiliary variables in missing data problems: Not all auxiliary\n\nvariables are created equal. Technical Report Technical Report R-002, Cornell University, 2013.\n\n[30] M.J. Van der Laan and J.M. Robins. Uni\ufb01ed methods for censored longitudinal data and causality.\n\nSpringer Verlag, 2003.\n\n[31] W. Wothke. Longitudinal and multigroup modeling with missing data. Lawrence Erlbaum Associates\n\nPublishers, 2000.\n\n9\n\n\f", "award": [], "sourceid": 659, "authors": [{"given_name": "Karthika", "family_name": "Mohan", "institution": "UCLA"}, {"given_name": "Judea", "family_name": "Pearl", "institution": "UCLA"}, {"given_name": "Jin", "family_name": "Tian", "institution": "Iowa State University"}]}