{"title": "Multi-view Anomaly Detection via Robust Probabilistic Latent Variable Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1136, "page_last": 1144, "abstract": "We propose probabilistic latent variable models for multi-view anomaly detection, which is the task of finding instances that have inconsistent views given multi-view data. With the proposed model, all views of a non-anomalous instance are assumed to be generated from a single latent vector. On the other hand, an anomalous instance is assumed to have multiple latent vectors, and its different views are generated from different latent vectors. By inferring the number of latent vectors used for each instance with Dirichlet process priors, we obtain multi-view anomaly scores. The proposed model can be seen as a robust extension of probabilistic canonical correlation analysis for noisy multi-view data. We present Bayesian inference procedures for the proposed model based on a stochastic EM algorithm. The effectiveness of the proposed model is demonstrated in terms of performance when detecting multi-view anomalies.", "full_text": "Multi-view Anomaly Detection via Robust\n\nProbabilistic Latent Variable Models\n\nTomoharu Iwata\n\nNTT Communication Science Laboratories\n\nMakoto Yamada\nKyoto University\n\niwata.tomoharu@lab.ntt.co.jp\n\nmakoto.m.yamada@ieee.org\n\nAbstract\n\nWe propose probabilistic latent variable models for multi-view anomaly detec-\ntion, which is the task of \ufb01nding instances that have inconsistent views given\nmulti-view data. With the proposed model, all views of a non-anomalous instance\nare assumed to be generated from a single latent vector. On the other hand, an\nanomalous instance is assumed to have multiple latent vectors, and its different\nviews are generated from different latent vectors. By inferring the number of la-\ntent vectors used for each instance with Dirichlet process priors, we obtain multi-\nview anomaly scores. The proposed model can be seen as a robust extension of\nprobabilistic canonical correlation analysis for noisy multi-view data. We present\nBayesian inference procedures for the proposed model based on a stochastic EM\nalgorithm. The effectiveness of the proposed model is demonstrated in terms of\nperformance when detecting multi-view anomalies.\n\n1 Introduction\n\nThere has been great interest in multi-view learning, in which data are obtained from various infor-\nmation sources. In a wide variety of applications, data are naturally comprised of multiple views.\nFor example, an image can be represented by color, texture and shape information; a web page can\nbe represented by words, images and URLs occurring on in the page; and a video can be represented\nby audio and visual features. In this paper, we consider the task of \ufb01nding anomalies in multi-view\ndata. The task is called horizontal anomaly detection [13], or multi-view anomaly detection [16].\nAnomalies in multi-view data are instances that have inconsistent features across multiple views.\n\nMulti-view anomaly detection can be used for many applications, such as information disparity man-\nagement [9], purchase behavior analysis [13], malicious insider detection [16], and user aggregation\nfrom multiple databases. In information disparity management, multiple views can be obtained from\ndocuments written in different languages such as Wikipedia. Multi-view anomaly detection tries to\n\ufb01nd documents that contain different information across different languages, which would be help-\nful for editors to select documents to be updated, or bene\ufb01cial for cultural anthropologists to analyze\nsocial difference across different languages. In purchase behavior analysis, multiple views for each\nitem can be de\ufb01ned as its genre and its purchase history, i.e. a set of users who purchased the item.\nMulti-view anomaly detection can \ufb01nd movies inconsistently purchased by users based on the movie\ngenre, which would assist creating marketing strategies.\n\nMulti-view anomaly detection is different from standard (single-view) anomaly detection. Single-\nview anomaly detection \ufb01nds instances that do not conform to expected behavior [6]. Figure 1 (a)\nshows the difference between a multi-view anomaly and a single-view anomaly in a two-view data\nset. \u2018M\u2019 is a multi-view anomaly since \u2018M\u2019 belongs to different clusters in different views (\u2018A\u2013D\u2019\ncluster in View 1 and \u2018E\u2013J\u2019 cluster in View 2) and views of \u2018M\u2019 are not consistent. \u2018S\u2019 is a single-\nview anomaly since \u2018S\u2019 is located far from other instances in each view. However, both views of\n\u2018S\u2019 have the same relationship with the others (they are far from the other instances), and then \u2018S\u2019\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fLatent space\n\nA\nB\nM C\nD\n\nI\n\nF\n\nG\n\nEH\n\nM\nJ\n\nW1 \u0582\n\nB\n\nD\nM C\n\nA\n\nF\n\nI\n\nG\n\nEH\n\nJ\n\nS\n\nS\n\nS\n\n\u0581 W2\n\nM\n\nI\n\nJ\n\nE H\n\nG\n\nF\n\nA\n\nD\n\nC\n\nB\n\nw\n\na\n\nb\n\ns\n\nx\n\nD\n\nz\n\n\u00a5 N\n\nr\n\nObserved view 1\n\nObserved view 2\n\n(a)\n\n(b)\n\nFigure 1: (a) A multi-view anomaly \u2018M\u2019 and a single-view anomaly \u2018S\u2019 in a two-view data set. Each\nletter represents an instance, and the same letter indicates the same instance. Wd is a projection\nmatrix for view d. (b) Graphical model representation of the proposed model.\n\nis not a multi-view anomaly. Single-view anomaly detection methods, such as one-class support\nvector machines [18] or tensor-based anomaly detection [11], consider that \u2018S\u2019 is anomalous. On\nthe other hand, we would like to develop a multi-view anomaly detection method that detects \u2018M\u2019 as\nanomaly, but not \u2018S\u2019. Note that although single-view anomalies are uncommon instances, multi-view\nanomalies can be majority if they are inconsistent across multiple views.\n\nWe propose a probabilistic latent variable model for multi-view anomaly detection. With the pro-\nposed model, there is a latent space that is shared across all views. We assume that all views of a\nnon-anomalous (normal) instance are generated using a single latent vector. On the other hand, an\nanomalous instance is assumed to have multiple latent vectors, and its different views are generated\nusing different latent vectors, which indicates inconsistency across different views of the instance.\nFigure 1 (a) shows an example of a latent space shared by the two-view data. Two views of every\nnon multi-view anomaly can be generated from a latent vector using view-dependent projection ma-\ntrices. On the other hand, since two views of multi-view anomaly \u2018M\u2019 are not consistent, two latent\nvectors are required to generate the two views using the projection matrices.\n\nSince the number of latent vectors for each instance is unknown, we automatically infer it from the\ngiven data by using Dirichlet process priors. The inference of the proposed model is based on a\nstochastic EM algorithm. In the E-step, a latent vector is assigned for each view of each instance\nusing collapsed Gibbs sampling while analytically integrating out latent vectors.\nIn the M-step,\nprojection matrices for mapping latent vectors into observations are estimated by maximizing the\njoint likelihood. By alternately iterating E- and M-steps, we infer the number of latent vectors used\nin each instance and calculate its anomaly score from the probability of using more than one latent\nvector.\n\n2 Proposed Model\n\nSuppose that we are given N instances with D views X = {Xn}N\nn=1, where Xn = {xnd}D\nis a set of multi-view observation vectors for the nth instance, and xnd \u2208 RMd is the observation\nvector of the dth view. The task is to \ufb01nd anomalous instances that have inconsistent observation\nfeatures across multiple views. We propose a probabilistic latent variable model for this task. The\nproposed model assumes that each instance has potentially a countably in\ufb01nite number of latent\n\nd=1\n\nvectors Zn = {znj}\u221e\nj=1, where znj \u2208 RK . Each view of an instance xnd is generated depending\non a view-speci\ufb01c projection matrix Wd \u2208 RMd\u00d7K and a latent vector znsnd that is selected from\na set of latent vectors Zn. Here, snd \u2208 {1,\u00b7\u00b7\u00b7 ,\u221e} is the latent vector assignment of xnd. When\nthe instance is non-anomalous and all its views are consistent, all of the views are generated from\na single latent vector.\nIn other words, the latent vector assignments for all views are the same,\nsn1 = sn2 = \u00b7\u00b7\u00b7 = snD. When it is an anomaly and some views are inconsistent, different views\n\n2\n\nq\ng\na\n\fare generated from different latent vectors, and some latent vector assignments are different, i.e.\n\nsnd 6= snd\u2032 for some d 6= d\u2032.\nSpeci\ufb01cally, the proposed model is an in\ufb01nite mixture model, where the probability for the dth view\nof the nth instance is given by\n\np(xnd|Zn, Wd, \u03b8n, \u03b1) =\n\n\u221e\n\nXj=1\n\n\u03b8njN (xnd|Wdznj, \u03b1\u22121I),\n\n(1)\n\nwhere \u03b8n = {\u03b8nj}\u221e\nj=1 are the mixture weights, \u03b8nj represents the probability of choosing the jth\nlatent vector, \u03b1 is a precision parameter, N (\u00b5, \u03a3) denotes the Gaussian distribution with mean \u00b5\nand covariance matrix \u03a3, and I is the identity matrix. Information of non-anomalous instances that\ncannot be handled by a single latent vector is modeled in Gaussian noise which is controlled by \u03b1.\nSince we assume the same observation noise \u03b1 across different views, the observations need to be\nnormalized. We use a Dirichlet process for the prior of mixture weight \u03b8n. Its use enables us to\nautomatically infer the number of latent vectors for each instance from the given data.\n\nThe complete generative process of the proposed model for multi-view instances X is as follows,\n\n1. Draw a precision parameter \u03b1 \u223c Gamma(a, b)\n2. For each instance: n = 1, . . . , N\n\n(a) Draw mixture weights \u03b8n \u223c Stick(\u03b3)\n(b) For each latent vector: j = 1, . . . ,\u221e\n\n(c) For each view: d = 1, . . . , D\n\ni. Draw a latent vector znj \u223c N (0, (\u03b1r)\u22121I)\ni. Draw a latent vector assignment snd \u223c Discrete(\u03b8n)\nii. Draw an observation vector xnd \u223c N (Wdznsnd , \u03b1\u22121I)\n\nHere, Stick(\u03b3) is the stick-breaking process [19] that generates mixture weights for a Dirichlet\nprocess with concentration parameter \u03b3, and r is the relative precision for latent vectors. \u03b1 is shared\nfor observation and latent vector precision because it makes it possible to analytically integrate out \u03b1\nas shown in (4). Figure 1 (b) shows a graphical model representation of the proposed model, where\nthe shaded and unshaded nodes indicate observed and latent variables, respectively.\n\nThe joint probability of the data X and the latent vector assignments S = {{snd}D\n\nd=1}N\n\nn=1 is given\n\nby\n\np(X, S|W , a, b, r, \u03b3) = p(S|\u03b3)p(X|S, W , a, b, r),\n\n(2)\n\nwhere W = {Wd}D\nd=1. Because we use conjugate priors, we can analytically integrate out mixture\nweights \u0398 = {\u03b8n}N\nn=1, latent vectors Z, and precision parameter \u03b1. Here, we use a Dirichlet\nprocess prior for multinomial parameter \u03b8n, and a Gaussian-Gamma prior for latent vector znj . By\nintegrating out mixture weights \u0398, the \ufb01rst factor is calculated by\n\np(S|\u03b3) =\n\nN\n\nYn=1\n\n\u03b3JnQJn\nj=1(Nnj \u2212 1)!\n\u03b3(\u03b3 + 1)\u00b7\u00b7\u00b7 (\u03b3 + D \u2212 1)\n\n,\n\n(3)\n\nwhere Nnj represents the number of views assigned to the jth latent vector in the nth instance, and\nJn is the number of latent vectors of the nth instance for which Nnj > 0. By integrating out latent\nvectors Z and precision parameter \u03b1, the second factor of (2) is calculated by\n\np(X|S, W , a, b, r) = (2\u03c0)\u2212 N Pd Md\n\n2\n\nr\n\nK Pn Jn\n\n2\n\nba\nb\u2032a\u2032\n\n\u0393(a\u2032)\n\u0393(a)\n\nN\n\nYn=1\n\nJn\n\nYj=1\n\n1\n\n2 ,\n\n|Cnj|\n\nwhere\n\na\u2032 = a +\n\nNPD\n\nd=1 Md\n2\n\n,\n\nb\u2032 = b +\n\n1\n2\n\nN\n\nD\n\nXn=1\n\nXd=1\n\nx\u22a4\nndxnd \u2212\n\n1\n2\n\nN\n\nXn=1\n\nJn\n\nXj=1\n\n3\n\n\u00b5\u22a4\n\nnjC \u22121\n\nnj \u00b5nj,\n\n(4)\n\n(5)\n\n\f\u00b5nj = Cnj Xd:snd=j\n\nW \u22a4\n\nd xnd, C \u22121\n\nnj = Xd:snd=j\n\nW \u22a4\n\nd Wd + rI.\n\nThe posterior for the precision parameter \u03b1 and that for the latent vector znj are given by\np(znj|X, S, W , r) = N (\u00b5nj, \u03b1\u22121Cnj),\n\np(\u03b1|X, S, W , a, b) = Gamma(a\u2032, b\u2032),\n\nrespectively.\n\n3 Inference\n\n(6)\n\n(7)\n\nWe describe inference procedures for the proposed model based on a stochastic EM algorithm, in\nwhich collapsed Gibbs sampling of latent vector assignments S and the maximum joint likelihood\nestimation of projection matrices W are alternately iterated while analytically integrating out the\nlatent vectors Z, mixture weights \u0398 and precision parameter \u03b1. By integrating out latent vectors,\nwe do not need to explicitly infer the latent vectors, leading to a robust and fast-mixing inference.\n\nLet \u2113 = (n, d) be the index of the dth view of the nth instance for notational convenience. In the\nE-step, given the current state of all but one latent assignment s\u2113, a new value for s\u2113 is sampled from\n{1,\u00b7\u00b7\u00b7 , Jn\\\u2113 + 1} according to the following probability,\np(s\u2113 = j, S\\\u2113|\u03b3)\n\np(X|s\u2113 = j, S\\\u2113, W , a, b, r)\n\n(8)\n\n,\n\np(s\u2113 = j|X, S\\\u2113, W , a, b, r, \u03b3) \u221d\n\n\u00b7\n\np(S\\\u2113|\u03b3)\n\np(X\\\u2113|S\\\u2113, W , a, b, r)\n\nwhere \\\u2113 represents a value or set excluding the dth view of the nth instance. The \ufb01rst factor is given\nby\n\np(s\u2113 = j, S\\\u2113|\u03b3)\n\np(S\\\u2113|\u03b3)\n\n=( Nnj\\\u2113\n\nD\u22121+\u03b3\n\nD\u22121+\u03b3\n\n\u03b3\n\nif j \u2264 Jn\\\u2113\nif j = Jn\\\u2113 + 1,\n\n(9)\n\nusing (3), where j \u2264 Jn\\\u2113 is for existing latent vectors, and j = Jn\\\u2113 + 1 is for a new latent vector.\nBy using (4), the second factor is given by\n\np(X|s\u2113 = j, S\\\u2113, W , a, b, r)\n\np(X\\\u2113|S\\\u2113, W , a, b, r)\n\n= (2\u03c0)\u2212 Md\n\n2 rI(j=Jn\\\u2113+1) K\n\n2\n\nb\n\n\\\u2113\n\nb\n\n\u2032a\u2032\n\\\u2113\n\u2032a\u2032\ns\u2113=j\n\ns\u2113=j\n\n\u0393(a\u2032\n\u0393(a\u2032\n\ns\u2113=j)\n\\\u2113)\n\n|Cj,s\u2113=j|\n|Cj\\\u2113|\n\n2\n\n1\n\n1\n\n2\n\n,\n\n(10)\n\nwhere I(\u00b7) represents the indicator function, i.e. I(A) = 1 if A is true and 0 otherwise, and subscript\ns\u2113 = j indicates the value when x\u2113 is assigned to the jth latent vector as follows,\n\nb\u2032\ns\u2113=j = b\u2032\n\n\\\u2113 +\n\n1\n2\n\nx\u22a4\n\n\u2113 x\u2113 +\n\n1\n2\n\n\u00b5\u22a4\n\nnj\\\u2113C \u22121\n\nnj\\\u2113\u00b5nj\\\u2113 \u2212\n\n1\n2\n\n\u00b5\u22a4\n\nnj,s\u2113=jC \u22121\n\nnj,s\u2113=j\u00b5nj,s\u2113=j,\n\na\u2032\ns\u2113=j = a\u2032, \u00b5nj,s\u2113=j = Cnj,s\u2113=j(W \u22a4\n\nd x\u2113 + C \u22121\n\nnj\\\u2113\u00b5nj\\\u2113),\n\nC \u22121\n\nnj,s\u2113=j = W \u22a4\n\nd Wd + C \u22121\n\nnj\\\u2113.\n\n(11)\n\n(12)\n\n(13)\n\nIntuitively, if the current view cannot be modeled well by existing latent vectors, a new latent vector\nis used, which indicates that the view is inconsistent with the other views.\n\nIn the M-step, the projection matrices W are estimated by maximizing the logarithm of the joint\nlikelihood (2) while \ufb01xing cluster assignment variables S. By setting the gradient of the joint log\nlikelihood with respect to W equal to zero, an estimate of W is obtained as follows,\n\nWd =(cid:16) a\u2032\n\nb\u2032\n\nN\n\nXn=1\n\nxnd\u00b5\u22a4\n\nnsnd(cid:17)(cid:16)\n\nXn=1\n\nN\n\nCnj +\n\nJn\n\nXj=1\n\na\u2032\nb\u2032\n\nN\n\nXn=1\n\n\u00b5nsnd \u00b5\u22a4\n\nnsnd(cid:17)\u22121\n\n.\n\n(14)\n\nWhen we iterate the E-step that samples the latent vector assignment snd by employing (8) for\neach view d = 1, . . . , D in each instance n = 1, . . . , N , and the M-step that maximizes the joint\nlikelihood using (14) with respect to the projection matrix Wd for each view d = 1, . . . , D, we\nobtain an estimate of the latent vector assignments and projection matrices.\n\n4\n\n\fIn Section 2, we de\ufb01ned that an instance is an anomaly when its different views are generated from\ndifferent latent vectors. Therefore, for an anomaly score, we use the probability that the instance\nuses more than one latent vector. It is estimated by using the samples obtained in the inference as\nfollows, vn = 1\nis the number of latent vectors used by the nth\ninstance in the hth iteration of the Gibbs sampling after the burn-in period, and H is the number\nof the iterations. The output of the proposed method is a ranked list of anomalies based on their\nanomaly scores. An analyst would investigate top few anomalies, or use a threshold to select the\nanomalies [6]. The threshold can be determined based on a targeted false alarm and detection rate.\n\nn > 1), where J (h)\n\nn\n\nh=1 I(J (h)\n\nH PH\n\nWe can use cross-validation to select an appropriate dimensionality for the latent space K. With\ncross-validation, we assume that some features are missing from the given data, and infer the model\nwith a different K. Then, we select the smallest K value that has performed the best at predicting\nmissing values.\n\n4 Related Work\n\nAnomaly detection has had a wide variety of applications, such as credit card fraud detection [1],\nintrusion detection for network security [17], and analysis for healthcare data [3]. However, most\nexisting anomaly detection techniques assume data with a single view, i.e. a single observation\nfeature set.\n\nA number of anomaly detection methods for two-view data have been proposed [12, 20\u201322, 24].\nHowever, they cannot be used for data with more than two views. Gao et al. [13] proposed a\nHOrizontal Anomaly Detection algorithm (HOAD) for \ufb01nding anomalies from multi-view data. In\nHOAD, there are hyperparameters including a weight for the constraint that require the data to be\nlabeled as anomalous or not for tuning, and the performance is sensitive to the hyperparameters. On\nthe other hand, the parameters with the proposed model can be estimated from the given multi-view\ndata without label information by maximizing the likelihood. In addition, because the proposed\nmodel is a probabilistic generative model, we can extend it in a probabilistically principled manner,\nfor example, for handling missing data and combining with other probabilistic models.\n\nLiu and Lam [16] proposed multi-view anomaly detection methods using consensus clustering. They\nfound anomalies based on the inconsistency of clustering results across multiple views. Therefore,\nthey cannot \ufb01nd inconsistency within a cluster. Christoudias et al. [8] proposed a method for \ufb01ltering\ninstances that are corrupted by background noise from multi-view data. The multi-view anomalies\nconsidered in this paper include not only instances corrupted by background noise but also instances\ncategorized into different foreground classes across views, and instances with inconsistent views\neven if they belong to the same cluster. Recently, Alvarez et al. [2] proposed a multi-view anomaly\ndetection method. However, since the method is based on clustering, it cannot \ufb01nd anomalies when\nthere are no clusters in the given data.\n\nThe proposed model is a generalization of either probabilistic principal component analysis\n(PPCA) [23] or probabilistic canonical correlation analysis (PCCA) [5]. When all views are gen-\nerated from different latent vectors for every instance, the proposed model corresponds to PPCA\nthat is performed independently for each view. When all views are generated from a single latent\nvector for every instance, the proposed model corresponds to PCCA with spherical noise.\n\nPCCA, or canonical correlation analysis (CCA), can be used for multi-view anomaly detection. With\nPCCA, a latent vector that is shared by all views for each instance and a linear projection matrix for\neach view are estimated by maximizing the likelihood, or minimizing the reconstruction error of the\ngiven data. The reconstruction error for each instance can be used as an anomaly score. However, the\nreconstruction errors are not reliable because they are calculated from parameters that are estimated\nusing data with anomalies by assuming that all of the instances are non-anomalous. On the other\nhand, because the proposed model simultaneously estimates the parameters and infers anomalies,\nthe estimated parameters are not contaminated by the anomalies. With PPCA and PCCA, Gaussian\ndistributions are used for observation noise, which are sensitive to atypical observations. Robust\nPPCA and PCCA [4] use Student-t distributions instead of Gaussian distributions, which are stable\nto data containing single-view anomalies. The proposed model assumes Gaussian observation noise,\nand its precision is parameterized by a Gamma distributed variable \u03b1. Since we marginalize out \u03b1\nin the inference as written in (4), the observation noise becomes a Student-t distribution. Therefore,\nthe proposed model is robust to single-view anomalies.\n\n5\n\n\fWith some CCA-related methods, each latent vector is factorized into shared and private components\nacross different views [10]. They assume that every instance has shared and private parts that are the\nsame dimensionality for all instances. In contrast, the proposed model assumes that non-anomalous\ninstances have only shared latent vectors, and anomalies have private latent vectors. The proposed\nmodel can be seen as CCA with private latent vectors, where latent vectors across views are clus-\ntered for each instance. When CCA with private latent vectors are inferred without clustering, the\ninferred private latent vectors do not become the same even if it is generated from a single latent vec-\ntor, because switching latent dimension or rotating the latent space does not change the likelihood.\nTherefore, difference of the latent vectors cannot be used for multi-view anomaly detection.\n\n5 Experiments\n\nData We evaluated the proposed model quantitatively by using 11 data sets, which we obtained\nfrom the LIBSVM data sets [7]. We generated two views by randomly splitting the features, where\neach feature can belong to only a single view, and anomalies were added by swapping views of\ntwo randomly selected instances regardless of their class labels for each view. Splitting data does\nnot generate anomalies. Therefore, we can evaluate methods while controlling the anomaly rate\nproperly. By swapping, although single-view anomalies cannot be created since the distribution for\neach view does not change, multi-view anomalies are created.\n\nComparing methods We compared the proposed model with probabilistic canonical correlation\nanalysis (PCCA), horizontal anomaly detection (HOAD) [13], consensus clustering based anomaly\ndetection (CC) [16], and one-class support vector machine (OCSVM) [18]. For PCCA, we used\nthe proposed model in which the number of latent vectors was \ufb01xed at one for every instance. The\nanomaly scores obtained with PCCA were calculated based on the reconstruction errors. HOAD re-\nquires to select an appropriate hyperparameter value for controlling the constraints whereby different\nviews of the same instance are embedded close together. We ran HOAD with different hyperparam-\neter settings {0.1, 1, 10, 100}, and show the results that achieved the highest performance for each\ndata set. For CC, \ufb01rst we clustered instances for each view using spectral clustering. We set the\nnumber of clusters at 20, which achieved a good performance in preliminary experiments. Then, we\ncalculated anomaly scores by the likelihood of consensus clustering when an instance was removed\nsince it indicates inconsistency of the instance across different views. OCSVM is a representative\nmethod for single-view anomaly detection. To investigate the performance of a single-view method\nfor multi-view anomaly detection, we included OCSVM as a comparison method. For OCSVM,\nmultiple views are concatenated in a single vector, then use it for the input. We used Gaussian ker-\nnel. In the proposed model, we used \u03b3 = 1, a = 1, and b = 1 for all experiments. The number of\niterations for the Gibbs sampling was 500, and the anomaly score was calculated by averaging over\nthe multiple samples.\n\nMulti-view anomaly detection For the evaluation measurement, we used the area under the ROC\ncurve (AUC). A higher AUC indicates a higher anomaly detection performance. Figure 2 shows\nAUCs with different rates of anomalies using 11 two-view data sets, which are averaged over 50\nexperiments. For the dimensionality of the latent space, we used K = 5 for the proposed model,\nPCCA, and HOAD. In general, as the anomaly rate increases, the performance decreases. The\nproposed model achieved the best performance with eight of the 11 data sets. This result indicates\nthat the proposed model can \ufb01nd anomalies effectively by inferring a number of latent vectors for\neach instance. The performance of CC was low because it assumes that there are clusters for each\nview, and it cannot \ufb01nd anomalies within clusters. The AUC of OCSVM was low, because it is a\nsingle-view anomaly detection method, which considers instances anomalous that are different from\nothers within a single view. Multi-view anomaly detection is the task to \ufb01nd instances that have\ninconsistent features across views, but not inconsistent features within a view. The computational\ntime needed for PCCA was 2 sec, and that needed for the proposed model was 35 sec with wine\ndata.\n\nFigure 3 shows AUCs with different dimensionalities of latent vectors using data sets whose anomaly\nrate is 0.4. When the dimensionality was very low (K = 1 or 2), the AUC was low in most of the data\nsets, because low-dimensional latent vectors cannot represent the observation vectors well. With all\nthe methods, the AUCs were relatively stable when the latent dimensionality was higher than four.\n\n6\n\n\fC\nU\nA\n\n0.7\n\n0.65\n\n0.6\n\n0.55\n\n0.62\n\n0.6\n\n0.58\n\n0.56\n\n0.54\n\n0.52\n\nC\nU\nA\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\nC\nU\nA\n\n(a) breast-cancer\n\n(b) diabetes\n\n(c) glass\n\nC\nU\nA\n\n0.6\n\n0.58\n\n0.56\n\n0.54\n\n0.52\n\n0.5\n\n0.65\n\nC\nU\nA\n\n0.6\n\n0.55\n\n0.5\n\n \n\nProposed\nPCCA\nHOAD\nCC\nOCSVM\n\n0.6\n\n0.8\n\n0.2\n\n0.4\n\nanomaly rate\n(d) heart\n\n0.2\n\n0.4\n\nanomaly rate\n\n0.6\n\n(e) ionosphere\n\n0.8\n\n0.2\n\n0.6\n\n0.4\n\nanomaly rate\n(f) sonar\n\n0.8\n\n \n\n(g) svmguide2\n\n0.2\n\n0.4\n\nanomaly rate\n\n0.6\n\n0.8\n\n(h) svmguide4\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n0.6\n\n0.55\n\nC\nU\nA\n\nC\nU\nA\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n0.6\n\n0.55\n\n0.4\n\n0.6\n\n0.2\n\nanomaly rate\n(i) vehicle\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\nC\nU\nA\n\n0.56\n\n0.54\n\n0.52\n\nC\nU\nA\n\n0.5\n\n0.48\n\n0.8\n\n0.2\n\n0.4\n\nanomaly rate\n(j) vowel\n\n0.6\n\n0.8\n\n0.2\n\n0.6\n\n0.8\n\n0.4\n\nanomaly rate\n(k) wine\n\n0.75\n\n0.7\n\n0.65\n\nC\nU\nA\n\n0.6\n\n0.55\n\n0.5\n\n0.8\n\n0.75\n\n0.7\n\nC\nU\nA\n\n0.65\n\n0.6\n\n0.55\n\n0.2\n\n0.4\n\nanomaly rate\n\n0.6\n\n0.8\n\n0.2\n\n0.4\n\nanomaly rate\n\n0.6\n\n0.8\n\n0.2\n\n0.4\n\nanomaly rate\n\n0.6\n\n0.8\n\n0.2\n\n0.4\n\nanomaly rate\n\n0.6\n\n0.8\n\nFigure 2: Average AUCs with different anomaly rates, and their standard errors. A higher AUC is\nbetter.\n\n(a) breast-cancer\n\n(b) diabetes\n\n(c) glass\n\n \n\nProposed\nPCCA\nHOAD\n\n(g) svmguide2\n\n10\n\n \n\n0.55\n\nC\nU\nA\n\n0.5\n\n6\n\n4\n\n8\nlatent dimensionality\n(f) sonar\n\n4\n\n6\n\nlatent dimensionality\n(j) vowel\n\n8\n\n10\n\n0.45\n\n2\n\n4\n\n6\n\n8\nlatent dimensionality\n(k) wine\n\n10\n\n10\n\n0.75\n\n0.7\n\nC\nU\nA\n\n0.65\n\n0.6\n\n0.55\n\n10\n\n2\n\n4\n\n8\nlatent dimensionality\n\n6\n\n0.58\n\n0.56\n\n0.54\n\nC\nU\nA\n\n0.52\n\n0.5\n\n10\n\n6\n\n2\n\n4\n\n8\nlatent dimensionality\n(e) ionosphere\n\nC\nU\nA\n\n0.62\n\n0.6\n\n0.58\n\n0.56\n\n0.54\n\n0.52\n\n0.5\n\n10\n\n2\n\n6\n\n4\n\n8\nlatent dimensionality\n(d) heart\n\n6\n\n2\n\n4\n\n8\nlatent dimensionality\n(h) svmguide4\n\n0.8\n\n0.7\n\nC\nU\nA\n\n0.6\n\n0.5\n\n10\n\n2\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n0.6\n\n0.55\n\nC\nU\nA\n\n4\n\n6\n\nlatent dimensionality\n\n(i) vehicle\n\n8\n\n10\n\n0.9\n\n0.8\n\nC\nU\nA\n\n0.7\n\n0.6\n\n0.5\n\n0.75\n\n0.7\n\nC\nU\nA\n\n0.65\n\n0.6\n\n0.55\n\n2\n\n2\n\n0.65\n\nC\nU\nA\n\n0.6\n\n0.55\n\nC\nU\nA\n\n0.6\n\n0.58\n\n0.56\n\n0.54\n\n0.52\n\n0.5\n\n0.9\n\n0.8\n\nC\nU\nA\n\n0.7\n\n0.6\n\n0.5\n\n2\n\n4\n\n6\n\nlatent dimensionality\n\n8\n\n10\n\n2\n\n4\n\n8\nlatent dimensionality\n\n6\n\n10\n\n2\n\n4\n\n8\nlatent dimensionality\n\n6\n\nFigure 3: Average AUCs with different dimensionalities of latent vectors, and their standard errors.\n\nSingle-view anomaly detection We would like to \ufb01nd multi-view anomalies, but woul not like to\ndetect single-view anomalies. We illustrated that the proposed model does not detect single-view\nanomalies using synthetic single-view anomaly data. With the synthetic data, latent vectors for\n\n7\n\n\fTable 1: Average AUCs for single-view anomaly detection.\n\nProposed\n0.117 \u00b1 0.098\n\nPCCA\n0.174 \u00b1 0.095\n\nOCSVM\n0.860 \u00b1 0.232\n\nTable 2: High and low anomaly score movies calculated by the proposed model.\n\nTitle\nThe Full Monty\nLiar Liar\nThe Professional\nMr. Holland\u2019s Opus\nContact\n\nScore Title\n\n0.98 Star Trek VI\n0.93 Star Trek III\n0.91 The Saint\n0.88 Heat\n0.87 Conspiracy Theory\n\nScore\n0.04\n0.04\n0.04\n0.03\n0.03\n\nsingle-view anomalies were generated from N (0,\u221a10I), and those for non-anomalous instances\nwere generated from N (0, I). Since each of the anomalies has only one single latent vector, it is\nnot a multi-view anomaly. The numbers of anomalous and non-anomalous instances were 5 and 95,\nrespectively. The dimensionalities of the observed and latent spaces were \ufb01ve and three, respectively.\nTable 1 shows the average AUCs with the single-view anomaly data, which are averaged over 50\ndifferent data sets. The low AUC of the proposed model indicates that it does not consider single-\nview anomalies as anomalies. On the other hand, the AUC of the one-class SVM (OCSVM) was\nhigh because OCSVM is a single-view anomaly detection method, and it leads to low multi-view\nanomaly detection performance.\n\nApplication to movie data For an application of multi-view anomaly detection, we analyzed in-\nconsistency between movie rating behavior and genre in MovieLens data [14]. An instance corre-\nsponds to a movie, where the \ufb01rst view represents whether the movie is rated or not by users, and the\nsecond view represents the movie genre. Both views consist of binary features, where some movies\nare categorized in multiple genres. We used 338 movies, 943 users and 19 genres. Table 2 shows\nhigh and low anomaly score movies when we analyzed the movie data by the proposed method with\nK = 5. \u2018The Full Monty\u2019 and \u2018Liar Liar\u2019 were categorized in \u2018Comedy\u2019 genre. They are rated\nby not only users who likes \u2018Comedy\u2019, but also who likes \u2018Romance\u2019 and \u2018Action-Thriller\u2019. \u2018The\nProfessional\u2019 was anomaly because it was rated by two different user groups, where a group prefers\n\u2018Romance\u2019 and the other prefers \u2018Action\u2019. Since \u2018Star Trek\u2019 series are typical Sci-Fi and liked by\nspeci\ufb01c users, its anomaly score was low.\n\n6 Conclusion\n\nWe proposed a generative model approach for multi-view anomaly detection, which \ufb01nds instances\nthat have inconsistent views.\nIn the experiments, we con\ufb01rmed that the proposed model could\nperform much better than existing methods for detecting multi-view anomalies. There are several\navenues that can be pursued for future work. Since the proposed model assumes the linearity of\nobservations with respect to their latent vectors, it cannot \ufb01nd anomalies when different views are\nin a nonlinear relationship. We can relax this assumption by using Gaussian processes [15]. We can\nalso relax the assumption that non-anomalous instances have the same latent vector across all views\nby introducing private latent vectors [10]. The proposed model assumes Gaussian observation noise.\nOur framework can be extended for binary or count data by using Bernoulli or Poisson distributions\ninstead of Gaussian.\n\nAcknowledgments\n\nMY was supported by KAKENHI 16K16114.\n\nReferences\n\n[1] E. Aleskerov, B. Freisleben, and B. Rao. Cardwatch: A neural network based database mining system for\ncredit card fraud detection. In Proceedings of the IEEE/IAFE Computational Intelligence for Financial\nEngineering, pages 220\u2013226, 1997.\n\n8\n\n\f[2] A. M. Alvarez, M. Yamada, A. Kimura, and T. Iwata. Clustering-based anomaly detection in multi-view\nIn Proceedings of ACM International Conference on Information and Knowledge Management,\n\ndata.\nCIKM, 2013.\n\n[3] M.-L. Antonie, O. R. Zaiane, and A. Coman. Application of data mining techniques for medical image\n\nclassi\ufb01cation. MDM/KDD, pages 94\u2013101, 2001.\n\n[4] C. Archambeau, N. Delannay, and M. Verleysen. Robust probabilistic projections. In Proceedings of the\n\n23rd International Conference on Machine Learning, pages 33\u201340, 2006.\n\n[5] F. R. Bach and M. I. Jordan. A probabilistic interpretation of canonical correlation analysis. Technical\n\nReport 688, Department of Statistics, University of California, Berkeley, 2005.\n\n[6] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Computing Surveys\n\n(CSUR), 41(3):15, 2009.\n\n[7] C. Chang and C. Lin. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent\n\nSystems and Technology (TIST), 2(3):27, 2011.\n\n[8] C. M. Christoudias, R. Urtasun, and T. Darrell. Multi-view learning in the presence of view disagreement.\n\nIn Proceedings of the 24th Conference on Unvertainty in Arti\ufb01cial Intelligence, UAI, 2008.\n\n[9] K. Duh, C.-M. A. Yeung, T. Iwata, and M. Nagata. Managing information disparity in multilingual\n\ndocument collections. ACM Transactions on Speech and Language Processing (TSLP), 10(1):1, 2013.\n\n[10] C. H. Ek, J. Rihan, P. H. Torr, G. Rogez, and N. D. Lawrence. Ambiguity modeling in latent spaces. In\n\nMachine Learning for Multimodal Interaction, pages 62\u201373. Springer, 2008.\n\n[11] H. Fanaee-T and J. a. Gama. Tensor-based anomaly detection. Know.-Based Syst., 98(C):130\u2013147, 2016.\n\n[12] J. Gao, F. Liang, W. Fan, C. Wang, Y. Sun, and J. Han. On community outliers and their ef\ufb01cient\ndetection in information networks. In Proceedings of the 16th ACM SIGKDD International Conference\non Knowledge Discovery and Data Mining, pages 813\u2013822. ACM, 2010.\n\n[13] J. Gao, W. Fan, D. Turaga, S. Parthasarathy, and J. Han. A spectral framework for detecting inconsistency\nacross multi-source object relationships. In IEEE 11th International Conference on Data Mining (ICDM),\npages 1050\u20131055. IEEE, 2011.\n\n[14] J. L. Herlocker, J. A. Konstan, A. Borchers, and J. Riedl. An algorithmic framework for performing\nIn Proceedings of the 22nd Annual International ACM SIGIR Conference on\n\ncollaborative \ufb01ltering.\nResearch and Development in Information Retrieval, pages 230\u2013237. ACM, 1999.\n\n[15] N. D. Lawrence. Gaussian process latent variable models for visualisation of high dimensional data.\n\nAdvances in Neural Information Processing Systems, 16(3):329\u2013336, 2004.\n\n[16] A. Y. Liu and D. N. Lam. Using consensus clustering for multi-view anomaly detection. In 2012 IEEE\n\nSymposium on Security and Privacy Workshops (SPW), pages 117\u2013124. IEEE, 2012.\n\n[17] L. Portnoy, E. Eskin, and S. Stolfo. Intrusion detection with unlabeled data using clustering. In Proceed-\n\nings of ACM CSS Workshop on Data Mining Applied to Security, 2001.\n\n[18] B. Sch\u00f6lkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support of\n\na high-dimensional distribution. Neural computation, 13(7):1443\u20131471, 2001.\n\n[19] J. Sethuraman. A constructive de\ufb01nition of Dirichlet priors. Statistica Sinica, 4:639\u2013650, 1994.\n\n[20] S. Shekhar, C.-T. Lu, and P. Zhang. Detecting graph-based spatial outliers. Intelligent Data Analysis, 6\n\n(5):451\u2013468, 2002.\n\n[21] X. Song, M. Wu, C. Jermaine, and S. Ranka. Conditional anomaly detection.\n\nIEEE Transactions on\n\nKnowledge and Data Engineering, 19(5):631\u2013645, 2007.\n\n[22] J. Sun, H. Qu, D. Chakrabarti, and C. Faloutsos. Neighborhood formation and anomaly detection in\nbipartite graphs. In Proceedings of the 5th IEEE International Conference on Data Mining, pages 418\u2013\n425. IEEE, 2005.\n\n[23] M. Tipping and C. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical\n\nSociety: Series B (Statistical Methodology), 61(3):611\u2013622, 1999.\n\n[24] X. Wang and I. Davidson. Discovering contexts and contextual outliers using random walks in graphs. In\n\nProceedings of the 9th IEEE International Conference on Data Mining, pages 1034\u20131039. IEEE, 2009.\n\n9\n\n\f", "award": [], "sourceid": 639, "authors": [{"given_name": "Tomoharu", "family_name": "Iwata", "institution": "NTT"}, {"given_name": "Makoto", "family_name": "Yamada", "institution": "Kyoto University"}]}