{"title": "Translating Embeddings for Modeling Multi-relational Data", "book": "Advances in Neural Information Processing Systems", "page_first": 2787, "page_last": 2795, "abstract": "We consider the problem of embedding entities and relationships of multi-relational data in low-dimensional vector spaces. Our objective is to propose a canonical model which is easy to train, contains a reduced number of parameters and can scale up to very large databases. Hence, we propose, TransE, a method which models relationships by interpreting them as translations operating on the low-dimensional embeddings of the entities. Despite its simplicity, this assumption proves to be powerful since extensive experiments show that TransE significantly outperforms state-of-the-art methods in link prediction on two knowledge bases. Besides, it can be successfully trained on a large scale data set with 1M entities, 25k relationships and more than 17M training samples.", "full_text": "Translating Embeddings for Modeling\n\nMulti-relational Data\n\nAntoine Bordes, Nicolas Usunier, Alberto Garcia-Dur\u00b4an\n\nUniversit\u00b4e de Technologie de Compi`egne \u2013 CNRS\n\nHeudiasyc UMR 7253\nCompi`egne, France\n\n{bordesan, nusunier, agarciad}@utc.fr\n\nJason Weston, Oksana Yakhnenko\n\nGoogle\n\n111 8th avenue\n\nNew York, NY, USA\n\n{jweston, oksana}@google.com\n\nAbstract\n\nWe consider the problem of embedding entities and relationships of multi-\nrelational data in low-dimensional vector spaces. Our objective is to propose a\ncanonical model which is easy to train, contains a reduced number of parameters\nand can scale up to very large databases. Hence, we propose TransE, a method\nwhich models relationships by interpreting them as translations operating on the\nlow-dimensional embeddings of the entities. Despite its simplicity, this assump-\ntion proves to be powerful since extensive experiments show that TransE signif-\nicantly outperforms state-of-the-art methods in link prediction on two knowledge\nbases. Besides, it can be successfully trained on a large scale data set with 1M\nentities, 25k relationships and more than 17M training samples.\n\nIntroduction\n\n1\nMulti-relational data refers to directed graphs whose nodes correspond to entities and edges of the\nform (head, label, tail) (denoted (h, (cid:96), t)), each of which indicates that there exists a relationship of\nname label between the entities head and tail. Models of multi-relational data play a pivotal role in\nmany areas. Examples are social network analysis, where entities are members and edges (relation-\nships) are friendship/social relationship links, recommender systems where entities are users and\nproducts and relationships are buying, rating, reviewing or searching for a product, or knowledge\nbases (KBs) such as Freebase1, Google Knowledge Graph2 or GeneOntology3, where each entity\nof the KB represents an abstract concept or concrete entity of the world and relationships are pred-\nicates that represent facts involving two of them. Our work focuses on modeling multi-relational\ndata from KBs (Wordnet [9] and Freebase [1] in this paper), with the goal of providing an ef\ufb01cient\ntool to complete them by automatically adding new facts, without requiring extra knowledge.\n\nModeling multi-relational data\nIn general, the modeling process boils down to extracting local or\nglobal connectivity patterns between entities, and prediction is performed by using these patterns to\ngeneralize the observed relationship between a speci\ufb01c entity and all others. The notion of locality\nfor a single relationship may be purely structural, such as the friend of my friend is my friend in\n\n1freebase.com\n2google.com/insidesearch/features/search/knowledge.html\n3geneontology.org\n\n1\n\n\fsocial networks, but can also depend on the entities, such as those who liked Star Wars IV also\nliked Star Wars V, but they may or may not like Titanic. In contrast to single-relational data where\nad-hoc but simple modeling assumptions can be made after some descriptive analysis of the data,\nthe dif\ufb01culty of relational data is that the notion of locality may involve relationships and entities\nof different types at the same time, so that modeling multi-relational data requires more generic\napproaches that can choose the appropriate patterns considering all heterogeneous relationships at\nthe same time.\nFollowing the success of user/item clustering or matrix factorization techniques in collaborative\n\ufb01ltering to represent non-trivial similarities between the connectivity patterns of entities in single-\nrelational data, most existing methods for multi-relational data have been designed within the frame-\nwork of relational learning from latent attributes, as pointed out by [6]; that is, by learning and\noperating on latent representations (or embeddings) of the constituents (entities and relationships).\nStarting from natural extensions of these approaches to the multi-relational domain such as non-\nparametric Bayesian extensions of the stochastic blockmodel [7, 10, 17] and models based on tensor\nfactorization [5] or collective matrix factorization [13, 11, 12], many of the most recent approaches\nhave focused on increasing the expressivity and the universality of the model in either Bayesian\nclustering frameworks [15] or energy-based frameworks for learning embeddings of entities in low-\ndimensional spaces [3, 15, 2, 14]. The greater expressivity of these models comes at the expense of\nsubstantial increases in model complexity which results in modeling assumptions that are hard to in-\nterpret, and in higher computational costs. Besides, such approaches are potentially subject to either\nover\ufb01tting since proper regularization of such high-capacity models is hard to design, or under\ufb01t-\nting due to the non-convex optimization problems with many local minima that need to be solved to\ntrain them. As a matter of fact, it was shown in [2] that a simpler model (linear instead of bilinear)\nachieves almost as good performance as the most expressive models on several multi-relational data\nsets with a relatively large number of different relationships. This suggests that even in complex\nand heterogeneous multi-relational domains simple yet appropriate modeling assumptions can lead\nto better trade-offs between accuracy and scalability.\n\nRelationships as translations in the embedding space\nIn this paper, we introduce TransE, an\nenergy-based model for learning low-dimensional embeddings of entities. In TransE, relationships\nare represented as translations in the embedding space: if (h, (cid:96), t) holds, then the embedding of the\ntail entity t should be close to the embedding of the head entity h plus some vector that depends\non the relationship (cid:96). Our approach relies on a reduced set of parameters as it learns only one\nlow-dimensional vector for each entity and each relationship.\nThe main motivation behind our translation-based parameterization is that hierarchical relationships\nare extremely common in KBs and translations are the natural transformations for representing them.\nIndeed, considering the natural representation of trees (i.e. embeddings of the nodes in dimension\n2), the siblings are close to each other and nodes at a given height are organized on the x-axis,\nthe parent-child relationship corresponds to a translation on the y-axis. Since a null translation\nvector corresponds to an equivalence relationship between entities, the model can then represent\nthe sibling relationship as well. Hence, we chose to use our parameter budget per relationship\n(one low-dimensional vector) to represent what we considered to be the key relationships in KBs.\nAnother, secondary, motivation comes from the recent work of [8], in which the authors learn word\nembeddings from free text, and some 1-to-1 relationships between entities of different types, such\n\u201ccapital of\u201d between countries and cities, are (coincidentally rather than willingly) represented by\nthe model as translations in the embedding space. This suggests that there may exist embedding\nspaces in which 1-to-1 relationships between entities of different types may, as well, be represented\nby translations. The intention of our model is to enforce such a structure of the embedding space.\nOur experiments in Section 4 demonstrate that this new model, despite its simplicity and its ar-\nchitecture primarily designed for modeling hierarchies, ends up being powerful on most kinds of\nrelationships, and can signi\ufb01cantly outperform state-of-the-art methods in link prediction on real-\nworld KBs. Besides, its light parameterization allows it to be successfully trained on a large scale\nsplit of Freebase containing 1M entities, 25k relationships and more than 17M training samples.\nIn the remainder of the paper, we describe our model in Section 2 and discuss its connections with\nrelated methods in Section 3. We detail an extensive experimental study on Wordnet and Freebase\nin Section 4, comparing TransE with many methods from the literature. We \ufb01nally conclude by\nsketching some future work directions in Section 5.\n\n2\n\n\fk\n\nk\n\n, 6\u221a\nk\n, 6\u221a\nk\n\n) for each (cid:96) \u2208 L\n) for each entity e \u2208 E\n\nAlgorithm 1 Learning TransE\ninput Training set S = {(h, (cid:96), t)}, entities and rel. sets E and L, margin \u03b3, embeddings dim. k.\n1: initialize (cid:96) \u2190 uniform(\u2212 6\u221a\n(cid:96) \u2190 (cid:96)/ (cid:107) (cid:96)(cid:107) for each (cid:96) \u2208 L\n2:\ne \u2190 uniform(\u2212 6\u221a\n3:\n4: loop\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n\ne \u2190 e/ (cid:107) e(cid:107) for each entity e \u2208 E\nSbatch \u2190sample(S, b) // sample a minibatch of size b\nTbatch \u2190 \u2205 // initialize the set of pairs of triplets\nfor (h, (cid:96), t) \u2208 Sbatch do\n(h(cid:48), (cid:96), t(cid:48)) \u2190sample(S(cid:48)\n\n\u2207(cid:2)\u03b3 + d(h + (cid:96), t) \u2212 d(h(cid:48) + (cid:96), t(cid:48))(cid:3)\n\nTbatch \u2190 Tbatch \u222a(cid:8)(cid:0)(h, (cid:96), t), (h(cid:48), (cid:96), t(cid:48))(cid:1)(cid:9)\n(cid:0)(h,(cid:96),t),(h(cid:48),(cid:96),t(cid:48))(cid:1)\u2208Tbatch\n\nend for\nUpdate embeddings w.r.t.\n\n(cid:88)\n\n(h,(cid:96),t)) // sample a corrupted triplet\n\n+\n\n13: end loop\n\n2 Translation-based model\nGiven a training set S of triplets (h, (cid:96), t) composed of two entities h, t \u2208 E (the set of entities) and a\nrelationship (cid:96) \u2208 L (the set of relationships), our model learns vector embeddings of the entities and\nthe relationships. The embeddings take values in Rk (k is a model hyperparameter) and are denoted\nwith the same letters, in boldface characters. The basic idea behind our model is that the functional\nrelation induced by the (cid:96)-labeled edges corresponds to a translation of the embeddings, i.e. we want\nthat h + (cid:96) \u2248 t when (h, (cid:96), t) holds (t should be a nearest neighbor of h + (cid:96)), while h + (cid:96) should be\nfar away from t otherwise. Following an energy-based framework, the energy of a triplet is equal to\nd(h + (cid:96), t) for some dissimilarity measure d, which we take to be either the L1 or the L2-norm.\nTo learn such embeddings, we minimize a margin-based ranking criterion over the training set:\n\nL =\n\n(cid:88)\n\n(cid:88)\n(cid:2)\u03b3 + d(h + (cid:96), t) \u2212 d(h(cid:48) + (cid:96), t(cid:48))(cid:3)\n(h,(cid:96),t) =(cid:8)(h(cid:48), (cid:96), t)|h(cid:48) \u2208 E(cid:9) \u222a(cid:8)(h, (cid:96), t(cid:48))|t(cid:48) \u2208 E(cid:9) .\n\n(h(cid:48),(cid:96),t(cid:48))\u2208S(cid:48)\n\n(h,(cid:96),t)\u2208S\n\n(h,(cid:96),t)\n\nS(cid:48)\n\n+\n\n(1)\n\nwhere [x]+ denotes the positive part of x, \u03b3 > 0 is a margin hyperparameter, and\n\n(2)\nThe set of corrupted triplets, constructed according to Equation 2, is composed of training triplets\nwith either the head or tail replaced by a random entity (but not both at the same time). The loss\nfunction (1) favors lower values of the energy for training triplets than for corrupted triplets, and is\nthus a natural implementation of the intended criterion. Note that for a given entity, its embedding\nvector is the same when the entity appears as the head or as the tail of a triplet.\nThe optimization is carried out by stochastic gradient descent (in minibatch mode), over the possible\nh, (cid:96) and t, with the additional constraints that the L2-norm of the embeddings of the entities is 1 (no\nregularization or norm constraints are given to the label embeddings (cid:96)). This constraint is important\nfor our model, as it is for previous embedding-based methods [3, 6, 2], because it prevents the\ntraining process to trivially minimize L by arti\ufb01cially increasing entity embeddings norms.\nThe detailed optimization procedure is described in Algorithm 1. All embeddings for entities and\nrelationships are \ufb01rst initialized following the random procedure proposed in [4]. At each main\niteration of the algorithm, the embedding vectors of the entities are \ufb01rst normalized. Then, a small\nset of triplets is sampled from the training set, and will serve as the training triplets of the minibatch.\nFor each such triplet, we then sample a single corrupted triplet. The parameters are then updated by\ntaking a gradient step with constant learning rate. The algorithm is stopped based on its performance\non a validation set.\n\n3 Related work\nSection 1 described a large body of work on embedding KBs. We detail here the links between our\nmodel and those of [3] (Structured Embeddings or SE) and [14].\n\n3\n\n\fTable 1: Numbers of parameters and their values\nfor FB15k (in millions). ne and nr are the nb. of en-\ntities and relationships; k the embeddings dimension.\n\nTable 2: Statistics of the data sets used\nin this paper and extracted from the two\nknowledge bases, Wordnet and Freebase.\n\nMETHOD\nNB. OF PARAMETERS\nUnstructured [2]\nO(nek)\nRESCAL [11]\nO(nek + nrk2)\nSE [3]\nO(nek + 2nrk2)\nSME(LINEAR) [2]\nO(nek + nrk + 4k2)\nSME(BILINEAR) [2] O(nek + nrk + 2k3)\nLFM [6]\nO(nek + nrk + 10k2)\nTransE\nO(nek + nrk)\n\nON FB15K\n0.75\n87.80\n7.47\n0.82\n1.06\n0.84\n0.81\n\nDATA SET\nENTITIES\nRELATIONSHIPS\nTRAIN. EX.\nVALID EX.\nTEST EX.\n\nWN\n40,943\n18\n141,442\n5,000\n5,000\n\nFB15K\n14,951\n1,345\n483,142\n50,000\n59,071\n\nFB1M\n1\u00d7106\n23,382\n17.5\u00d7106\n50,000\n177,404\n\nSE [3] embeds entities into Rk, and relationships into two matrices L1 \u2208 Rk\u00d7k and L2 \u2208 Rk\u00d7k\nsuch that d(L1h, L2t) is large for corrupted triplets (h, (cid:96), t) (and small otherwise). The basic idea\nis that when two entities belong to the same triplet, their embeddings should be close to each other\nin some subspace that depends on the relationship. Using two different projection matrices for the\nhead and for the tail is intended to account for the possible asymmetry of relationship (cid:96). When the\ndissimilarity function takes the form of d(x, y) = g(x \u2212 y) for some g : Rk \u2192 R (e.g. g is a\nnorm), then SE with an embedding of size k + 1 is strictly more expressive than our model with an\nembedding of size k, since linear operators in dimension k + 1 can reproduce af\ufb01ne transformations\nin a subspace of dimension k (by constraining the k +1th dimension of all embeddings to be equal to\n1). SE, with L2 as the identity matrix and L1 taken so as to reproduce a translation is then equivalent\nto TransE. Despite the lower expressiveness of our model, we still reach better performance than\nSE in our experiments. We believe this is because (1) our model is a more direct way to represent\nthe true properties of the relationship, and (2) optimization is dif\ufb01cult in embedding models. For\nSE, greater expressiveness seems to be more synonymous to under\ufb01tting than to better performance.\nTraining errors (in Section 4.3) tend to con\ufb01rm this point.\nAnother related approach is the Neural Tensor Model [14]. A special case of this model corresponds\nto learning scores s(h, (cid:96), t) (lower scores for corrupted triplets) of the form:\n\ns(h, (cid:96), t) = hT Lt + (cid:96)T\n\n1 h + (cid:96)T\n2 t\n\n(3)\n\n2 \u22122(cid:0)hT t + (cid:96)T (t \u2212 h)(cid:1) .\n\nwhere L \u2208 Rk\u00d7k, L1 \u2208 Rk and L2 \u2208 Rk, all of them depending on (cid:96).\nIf we consider TransE with the squared euclidean distance as dissimilarity function, we have:\n\nd(h + (cid:96), t) =(cid:107) h(cid:107)2\n\n2 + (cid:107) (cid:96)(cid:107)2\n2=(cid:107) t(cid:107)2\n\n2 + (cid:107) t(cid:107)2\nConsidering our norm constraints ((cid:107) h(cid:107)2\n2= 1) and the ranking criterion (1), in which (cid:107) (cid:96)(cid:107)2\n2\ndoes not play any role in comparing corrupted triplets, our model thus involves scoring the triplets\nwith hT t + (cid:96)T (t \u2212 h), and hence corresponds to the model of [14] (Equation (3)) where L is the\nidentity matrix, and (cid:96) = (cid:96)1 = \u2212(cid:96)2. We could not run experiments with this model (since it has been\npublished simultaneously as ours), but once again TransE has much fewer parameters: this could\nsimplify the training and prevent under\ufb01tting, and may compensate for a lower expressiveness.\nNevertheless, the simple formulation of TransE, which can be seen as encoding a series of 2-way\ninteractions (e.g. by developing the L2 version), involves drawbacks. For modeling data where\n3-way dependencies between h, (cid:96) and t are crucial, our model can fail. For instance, on the small-\nscale Kinships data set [7], TransE does not achieve performance in cross-validation (measured\nwith the area under the precision-recall curve) competitive with the state-of-the-art [11, 6], because\nsuch ternary interactions are crucial in this case (see discussion in [2]). Still, our experiments of\nSection 4 demonstrate that, for handling generic large-scale KBs like Freebase, one should \ufb01rst\nmodel properly the most frequent connectivity patterns, as TransE does.\n\n4 Experiments\nOur approach, TransE, is evaluated on data extracted from Wordnet and Freebase (their statistics are\ngiven in Table 2), against several recent methods from the literature which were shown to achieve\nthe best current performance on various benchmarks and to scale to relatively large data sets.\n\n4\n\n\f4.1 Data sets\n\nWordnet This KB is designed to produce an intuitively usable dictionary and thesaurus, and sup-\nport automatic text analysis. Its entities (termed synsets) correspond to word senses, and relation-\nships de\ufb01ne lexical relations between them. We considered the data version used in [2], which we\ndenote WN in the following. Examples of triplets are ( score NN 1, hypernym, evaluation NN 1)\nor ( score NN 2, has part, musical notation NN 1).4\n\nFreebase Freebase is a huge and growing KB of general facts; there are currently around 1.2\nbillion triplets and more than 80 million entities. We created two data sets with Freebase. First, to\nmake a small data set to experiment on we selected the subset of entities that are also present in\nthe Wikilinks database5 and that also have at least 100 mentions in Freebase (for both entities and\nrelationships). We also removed relationships like \u2019!/people/person/nationality\u2019 which just reverses\nthe head and tail compared to the relationship \u2019/people/person/nationality\u2019. This resulted in 592,213\ntriplets with 14,951 entities and 1,345 relationships which were randomly split as shown in Table 2.\nThis data set is denoted FB15k in the rest of this section. We also wanted to have large-scale data\nin order to test TransE at scale. Hence, we created another data set from Freebase, by selecting the\nmost frequently occurring 1 million entities. This led to a split with around 25k relationships and\nmore than 17 millions training triplets, which we refer to as FB1M.\n\n4.2 Experimental setup\n\nEvaluation protocol For evaluation, we use the same ranking procedure as in [3]. For each test\ntriplet, the head is removed and replaced by each of the entities of the dictionary in turn. Dissimi-\nlarities (or energies) of those corrupted triplets are \ufb01rst computed by the models and then sorted by\nascending order; the rank of the correct entity is \ufb01nally stored. This whole procedure is repeated\nwhile removing the tail instead of the head. We report the mean of those predicted ranks and the\nhits@10, i.e. the proportion of correct entities ranked in the top 10.\nThese metrics are indicative but can be \ufb02awed when some corrupted triplets end up being valid\nones, from the training set for instance. In this case, those may be ranked above the test triplet, but\nthis should not be counted as an error because both triplets are true. To avoid such a misleading\nbehavior, we propose to remove from the list of corrupted triplets all the triplets that appear either in\nthe training, validation or test set (except the test triplet of interest). This ensures that all corrupted\ntriplets do not belong to the data set. In the following, we report mean ranks and hits@10 according\nto both settings: the original (possibly \ufb02awed) one is termed raw, while we refer to the newer as\n\ufb01ltered (or \ufb01lt.). We only provide raw results for experiments on FB1M.\n\nBaselines The \ufb01rst method is Unstructured, a version of TransE which considers the data as\nmono-relational and sets all translations to 0 (it was already used as baseline in [2]). We also\ncompare with RESCAL, the collective matrix factorization model presented in [11, 12], and the\nenergy-based models SE [3], SME(linear)/SME(bilinear) [2] and LFM [6]. RESCAL is trained via\nan alternating least-square method, whereas the others are trained by stochastic gradient descent,\nlike TransE. Table 1 compares the theoretical number of parameters of the baselines to our model,\nand gives the order of magnitude on FB15k. While SME(linear), SME(bilinear), LFM and TransE\nhave about the same number of parameters as Unstructured for low dimensional embeddings, the\nother algorithms SE and RESCAL, which learn at least one k \u00d7 k matrix for each relationship\nrapidly need to learn many parameters. RESCAL needs about 87 times more parameters on FB15k\nbecause it requires a much larger embedding space than other models to achieve good performance.\nWe did not experiment on FB1M with RESCAL, SME(bilinear) and LFM for scalability reasons in\nterms of numbers of parameters or training duration.\nWe trained all baseline methods using the code provided by the authors. For RESCAL, we had to set\nthe regularization parameter to 0 for scalability reasons, as it is indicated in [11], and chose the latent\ndimension k among {50, 250, 500, 1000, 2000} that led to the lowest mean predicted ranks on the\nvalidation sets (using the raw setting). For Unstructured, SE, SME(linear) and SME(bilinear), we\n\n4WN is composed of senses, its entities are denoted by the concatenation of a word, its part-of-speech tag\n\nand a digit indicating which sense it refers to i.e. score NN 1 encodes the \ufb01rst meaning of the noun \u201cscore\u201d.\n\n5code.google.com/p/wiki-links\n\n5\n\n\fTable 3: Link prediction results. Test performance of the different methods.\nFB1M\n\nFB15K\n\nWN\n\nHITS@10 (%) MEAN RANK HITS@10 (%) MEAN RANK HITS@10 (%)\nRaw\n35.3\n37.2\n68.5\n65.1\n54.7\n71.4\n75.4\n\nRaw\n1,074\n828\n273\n274\n284\n283\n243\n\nFilt.\n38.2\n52.8\n80.5\n74.1\n61.3\n81.6\n89.2\n\nRaw\n4.5\n28.4\n28.8\n30.7\n31.3\n26.0\n34.9\n\nFilt.\n6.3\n44.1\n39.8\n40.8\n41.3\n33.1\n47.1\n\nFilt.\n979\n683\n162\n154\n158\n164\n125\n\nRaw\n2.9\n-\n\n17.5\n\n14,615\n\nRaw\n\n15,139\n\n22,044\n\n-\n-\n-\n\n34.0\n\nDATASET\nMETRIC\nEval. setting\nUnstructured [2]\nRESCAL [11]\nSE [3]\nSME(LINEAR) [2]\nSME(BILINEAR) [2]\nLFM [6]\nTransE\n\nMEAN RANK\nFilt.\nRaw\n315\n304\n1,163\n1,180\n985\n1,011\n533\n545\n509\n526\n456\n469\n263\n251\n\n-\n\n-\n-\n-\n\nselected the learning rate among {0.001, 0.01, 0.1}, k among {20, 50}, and selected the best model\nby early stopping using the mean rank on the validation sets (with a total of at most 1,000 epochs\nover the training data). For LFM, we also used the mean validation ranks to select the model and to\nchoose the latent dimension among {25, 50, 75}, the number of factors among {50, 100, 200, 500}\nand the learning rate among {0.01, 0.1, 0.5}.\n\nImplementation For experiments with TransE, we selected the learning rate \u03bb for the stochastic\ngradient descent among {0.001, 0.01, 0.1}, the margin \u03b3 among {1, 2, 10} and the latent dimension\nk among {20, 50} on the validation set of each data set. The dissimilarity measure d was set either\nto the L1 or L2 distance according to validation performance as well. Optimal con\ufb01gurations were:\nk = 20, \u03bb = 0.01, \u03b3 = 2, and d = L1 on Wordnet; k = 50, \u03bb = 0.01, \u03b3 = 1, and d = L1 on\nFB15k; k = 50, \u03bb = 0.01, \u03b3 = 1, and d = L2 on FB1M. For all data sets, training time was limited\nto at most 1, 000 epochs over the training set. The best models were selected by early stopping using\nthe mean predicted ranks on the validation sets (raw setting). An open-source implementation of\nTransE is available from the project webpage6.\n\n4.3 Link prediction\n\nOverall results Tables 3 displays the results on all data sets for all compared methods. As ex-\npected, the \ufb01ltered setting provides lower mean ranks and higher hits@10, which we believe are\na clearer evaluation of the performance of the methods in link prediction. However, generally the\ntrends between raw and \ufb01ltered are the same.\nOur method, TransE, outperforms all counterparts on all metrics, usually with a wide margin, and\nreaches some promising absolute performance scores such as 89% of hits@10 on WN (over more\nthan 40k entities) and 34% on FB1M (over 1M entities). All differences between TransE and the\nbest runner-up methods are important.\nWe believe that the good performance of TransE is due to an appropriate design of the model\naccording to the data, but also to its relative simplicity. This means that it can be optimized ef\ufb01ciently\nwith stochastic gradient. We showed in Section 3 that SE is more expressive than our proposal.\nHowever, its complexity may make it quite hard to learn, resulting in worse performance. On FB15k,\nSE achieves a mean rank of 165 and hits@10 of 35.5% on a subset of 50k triplets of the training set,\nwhereas TransE reaches 127 and 42.7%, indicating that TransE is indeed less subject to under\ufb01tting\nand that this could explain its better performances. SME(bilinear) and LFM suffer from the same\ntraining issue: we never managed to train them well enough so that they could exploit their full\ncapabilities. The poor results of LFM might also be explained by our evaluation setting, based\non ranking entities, whereas LFM was originally proposed to predict relationships. RESCAL can\nachieve quite good hits@10 on FB15k but yields poor mean ranks, especially on WN, even when\nwe used large latent dimensions (2, 000 on Wordnet).\nThe impact of the translation term is huge. When one compares performance of TransE and Un-\nstructured (i.e. TransE without translation), mean ranks of Unstructured appear to be rather good\n(best runner-up on WN), but hits@10 are very poor. Unstructured simply clusters all entities co-\noccurring together, independent of the relationships involved, and hence can only make guesses\nof which entities are related. On FB1M, the mean ranks of TransE and Unstructured are almost\nsimilar, but TransE places 10 times more predictions in the top 10.\n\n6Available at http://goo.gl/0PpKQe.\n\n6\n\n\fTable 4: Detailed results by category of relationship. We compare Hits@10 (in %) on FB15k in\nthe \ufb01ltered evaluation setting for our model, TransE and baselines. (M. stands for MANY).\n\nTASK\nREL. CATEGORY\nUnstructured [2]\nSE [3]\nSME(LINEAR) [2]\nSME(BILINEAR) [2]\nTransE\n\n1-TO-1\n34.5\n35.6\n35.1\n30.9\n43.7\n\nPREDICTING head\n\nPREDICTING tail\n\n1-TO-M. M.-TO-1 M.-TO-M.\n\n1-TO-M. M.-TO-1 M.-TO-M.\n\n2.5\n62.6\n53.7\n69.6\n65.7\n\n6.1\n17.2\n19.0\n19.9\n18.2\n\n6.6\n37.5\n40.3\n38.6\n47.2\n\n1-TO-1\n34.3\n34.9\n32.7\n28.2\n43.7\n\n4.2\n14.6\n14.9\n13.1\n19.7\n\n1.9\n68.3\n61.6\n76.0\n66.7\n\n6.6\n41.3\n43.3\n41.8\n50.0\n\nTable 5: Example predictions on the FB15k test set using TransE. Bold indicates the test triplet\u2019s\ntrue tail and italics other true tails present in the training set.\n\nINPUT (HEAD AND LABEL)\nJ. K. Rowling influenced by\n\nAnthony LaPaglia performed in\n\nCamden County adjoins\n\nThe 40-Year-Old Virgin nominated for\n\nCosta Rica football team has position\n\nLil Wayne born in\n\nWALL-E has the genre\n\nPREDICTED TAILS\n\nG. K. Chesterton, J. R. R. Tolkien, C. S. Lewis, Lloyd Alexander,\n\nTerry Pratchett, Roald Dahl, Jorge Luis Borges, Stephen King, Ian Fleming\n\nLantana, Summer of Sam, Happy Feet, The House of Mirth,\n\nUnfaithful, Legend of the Guardians, Naked Lunch, X-Men, The Namesake\n\nBurlington County, Atlantic County, Gloucester County, Union County,\nEssex County, New Jersey, Passaic County, Ocean County, Bucks County\n\nMTV Movie Award for Best Comedic Performance,\n\nBFCA Critics\u2019 Choice Award for Best Comedy,\nMTV Movie Award for Best On-Screen Duo,\n\nMTV Movie Award for Best Breakthrough Performance,\n\nMTV Movie Award for Best Movie, MTV Movie Award for Best Kiss,\nD. F. Zanuck Producer of the Year Award in Theatrical Motion Pictures,\n\nScreen Actors Guild Award for Best Actor - Motion Picture\n\nForward, Defender, Mid\ufb01elder, Goalkeepers,\n\nPitchers, In\ufb01elder, Out\ufb01elder, Center, Defenseman\n\nNew Orleans, Atlanta, Austin, St. Louis,\n\nToronto, New York City, Wellington, Dallas, Puerto Rico\n\nAnimations, Computer Animation, Comedy \ufb01lm,\n\nAdventure \ufb01lm, Science Fiction, Fantasy, Stop motion, Satire, Drama\n\nDetailed results Table 4 classi\ufb01es the results (in hits@10) on FB15k depending on several cate-\ngories of the relationships and on the argument to predict for several of the methods. We categorized\nthe relationships according to the cardinalities of their head and tail arguments into four classes:\n1-TO-1, 1-TO-MANY, MANY-TO-1, MANY-TO-MANY. A given relationship is 1-TO-1 if a head\ncan appear with at most one tail, 1-TO-MANY if a head can appear with many tails, MANY-TO-1 if\nmany heads can appear with the same tail, or MANY-TO-MANY if multiple heads can appear with\nmultiple tails. We classi\ufb01ed the relationships into these four classes by computing, for each relation-\nship (cid:96), the averaged number of heads h (respect. tails t) appearing in the FB15k data set, given a pair\n((cid:96), t) (respect. a pair (h, (cid:96))). If this average number was below 1.5 then the argument was labeled\nas 1 and MANY otherwise. For example, a relationship having an average of 1.2 head per tail and\nof 3.2 tails per head was classi\ufb01ed as 1-to-Many. We obtained that FB15k has 26.2% of 1-TO-1\nrelationships, 22.7% of 1-TO-MANY, 28.3% of MANY-TO-1, and 22.8% of MANY-TO-MANY.\nThese detailed results in Table 4 allow for a precise evaluation and understanding of the behavior of\nthe methods. First, it appears that, as one would expect, it is easier to predict entities on the \u201cside\n1\u201d of triplets (i.e., predicting head in 1-TO-MANY and tail in MANY-TO-1), that is when multiple\nentities point to it. These are the well-posed cases. SME(bilinear) proves to be very accurate\nin such cases because they are those with the most training examples. Unstructured performs\nwell on 1-TO-1 relationships: this shows that arguments of such relationships must share common\nhidden types that Unstructured is able to somewhat uncover by clustering entities linked together\nin the embedding space. But this strategy fails for any other category of relationship. Adding\nthe translation term (i.e. upgrading Unstructured into TransE) brings the ability to move in the\nembeddings space, from one entity cluster to another by following relationships. This is particularly\nspectacular for the well-posed cases.\n\nIllustration Table 5 gives examples of link prediction results of TransE on the FB15k test set\n(predicting tail). This illustrates the capabilities of our model. Given a head and a label, the top\n\n7\n\n\fFigure 1: Learning new relationships with few examples. Comparative experiments on FB15k\ndata evaluated in mean rank (left) and hits@10 (right). More details in the text.\n\npredicted tails (and the true one) are depicted. The examples come from the FB15k test set. Even if\nthe good answer is not always top-ranked, the predictions re\ufb02ect common-sense.\n4.4 Learning to predict new relationships with few examples\nUsing FB15k, we wanted to test how well methods could generalize to new facts by checking how\nfast they were learning new relationships. To that end, we randomly selected 40 relationships and\nsplit the data into two sets: a set (named FB15k-40rel) containing all triplets with these 40 rela-\ntionships and another set (FB15k-rest) containing the rest. We made sure that both sets contained\nall entities. FB15k-rest has then been split into a training set of 353,788 triplets and a validation\nset of 53,266, and FB15k-40rel into a training set of 40,000 triplets (1,000 for each relationship)\nand a test set of 45,159. Using these data sets, we conducted the following experiment: (1) models\nwere trained and selected using FB15k-rest training and validation sets, (2) they were subsequently\ntrained on the training set FB15k-40rel but only to learn the parameters related to the fresh 40 rela-\ntionships, (3) they were evaluated in link prediction on the test set of FB15k-40rel (containing only\nrelationships unseen during phase (1)). We repeated this procedure while using 0, 10, 100 and 1000\nexamples of each relationship in phase (2).\nResults for Unstructured, SE, SME(linear), SME(bilinear) and TransE are presented in Figure 1.\nThe performance of Unstructured is the best when no example of the unknown relationship is\nprovided, because it does not use this information to predict. But, of course, this performance does\nnot improve while providing labeled examples. TransE is the fastest method to learn: with only\n10 examples of a new relationship, the hits@10 is already 18% and it improves monotonically with\nthe number of provided samples. We believe the simplicity of the TransE model makes it able to\ngeneralize well, without having to modify any of the already trained embeddings.\n5 Conclusion and future work\nWe proposed a new approach to learn embeddings of KBs, focusing on the minimal parametrization\nof the model to primarily represent hierarchical relationships. We showed that it works very well\ncompared to competing methods on two different knowledge bases, and is also a highly scalable\nmodel, whereby we applied it to a very large-scale chunk of Freebase data. Although it remains\nunclear to us if all relationship types can be modeled adequately by our approach, by breaking down\nthe evaluation into categories (1-to-1, 1-to-Many, . . . ) it appears to be performing well compared to\nother approaches across all settings.\nFuture work could analyze this model further, and also concentrates on exploiting it in more tasks,\nin particular, applications such as learning word representations inspired by [8]. Combining KBs\nwith text as in [2] is another important direction where our approach could prove useful. Hence, we\nrecently fruitfully inserted TransE into a framework for relation extraction from text [16].\n\nAcknowledgments\nThis work was carried out in the framework of the Labex MS2T (ANR-11-IDEX-0004-02), and\nfunded by the French National Agency for Research (EVEREST-12-JS02-005-01). We thank X.\nGlorot for providing the code infrastructure, T. Strohmann and K. Murphy for useful discussions.\n\n8\n\n\fReferences\n[1] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created\ngraph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD\ninternational conference on Management of data, 2008.\n\n[2] A. Bordes, X. Glorot, J. Weston, and Y. Bengio. A semantic matching energy function for\n\nlearning with multi-relational data. Machine Learning, 2013.\n\n[3] A. Bordes, J. Weston, R. Collobert, and Y. Bengio. Learning structured embeddings of knowl-\nedge bases. In Proceedings of the 25th Annual Conference on Arti\ufb01cial Intelligence (AAAI),\n2011.\n\n[4] X. Glorot and Y. Bengio. Understanding the dif\ufb01culty of training deep feedforward neural net-\nworks. In Proceedings of the International Conference on Arti\ufb01cial Intelligence and Statistics\n(AISTATS)., 2010.\n\n[5] R. A. Harshman and M. E. Lundy. Parafac: parallel factor analysis. Computational Statistics\n\n& Data Analysis, 18(1):39\u201372, Aug. 1994.\n\n[6] R. Jenatton, N. Le Roux, A. Bordes, G. Obozinski, et al. A latent factor model for highly\nmulti-relational data. In Advances in Neural Information Processing Systems (NIPS 25), 2012.\n[7] C. Kemp, J. B. Tenenbaum, T. L. Grif\ufb01ths, T. Yamada, and N. Ueda. Learning systems of\nconcepts with an in\ufb01nite relational model. In Proceedings of the 21st Annual Conference on\nArti\ufb01cial Intelligence (AAAI), 2006.\n\n[8] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of\nwords and phrases and their compositionality. In Advances in Neural Information Processing\nSystems (NIPS 26), 2013.\n\n[9] G. Miller. WordNet: a Lexical Database for English. Communications of the ACM, 38(11):39\u2013\n\n41, 1995.\n\n[10] K. Miller, T. Grif\ufb01ths, and M. Jordan. Nonparametric latent feature models for link prediction.\n\nIn Advances in Neural Information Processing Systems (NIPS 22), 2009.\n\n[11] M. Nickel, V. Tresp, and H.-P. Kriegel. A three-way model for collective learning on multi-\nIn Proceedings of the 28th International Conference on Machine Learning\n\nrelational data.\n(ICML), 2011.\n\n[12] M. Nickel, V. Tresp, and H.-P. Kriegel. Factorizing YAGO: scalable machine learning for\nlinked data. In Proceedings of the 21st international conference on World Wide Web (WWW),\n2012.\n\n[13] A. P. Singh and G. J. Gordon. Relational learning via collective matrix factorization.\n\nIn\nProceedings of the 14th ACM SIGKDD Conference on Knowledge Discovery and Data Mining\n(KDD), 2008.\n\n[14] R. Socher, D. Chen, C. D. Manning, and A. Y. Ng. Learning new facts from knowledge bases\nwith neural tensor networks and semantic word vectors. In Advances in Neural Information\nProcessing Systems (NIPS 26), 2013.\n\n[15] I. Sutskever, R. Salakhutdinov, and J. Tenenbaum. Modelling relational data using bayesian\nclustered tensor factorization. In Advances in Neural Information Processing Systems (NIPS\n22), 2009.\n\n[16] J. Weston, A. Bordes, O. Yakhnenko, and N. Usunier. Connecting language and knowledge\nIn Proceedings of the Conference on\n\nbases with embedding models for relation extraction.\nEmpirical Methods in Natural Language Processing (EMNLP), 2013.\n\n[17] J. Zhu. Max-margin nonparametric latent feature models for link prediction. In Proceedings\n\nof the 29th International Conference on Machine Learning (ICML), 2012.\n\n9\n\n\f", "award": [], "sourceid": 1282, "authors": [{"given_name": "Antoine", "family_name": "Bordes", "institution": "Universit\u00e9 de Technologie de Compi\u00e8gne (UTC)"}, {"given_name": "Nicolas", "family_name": "Usunier", "institution": "Universit\u00e9 de Technologie de Compi\u00e8gne (UTC)"}, {"given_name": "Alberto", "family_name": "Garcia-Duran", "institution": "Universit\u00e9 de Technologie de Compi\u00e8gne (UTC)"}, {"given_name": "Jason", "family_name": "Weston", "institution": "Google Research"}, {"given_name": "Oksana", "family_name": "Yakhnenko", "institution": "Google Research"}]}