{"title": "Distributed Representations of Words and Phrases and their Compositionality", "book": "Advances in Neural Information Processing Systems", "page_first": 3111, "page_last": 3119, "abstract": "The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships.  In this paper we present several improvements that make the Skip-gram model more expressive and enable it to learn higher quality vectors more rapidly.  We show that by subsampling frequent words we obtain significant speedup,  and also learn higher quality representations as measured by our tasks. We also introduce Negative Sampling, a simplified variant of Noise Contrastive Estimation (NCE) that learns more accurate vectors for frequent words compared to the hierarchical softmax.   An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases.  For example, the meanings of Canada'' and \"Air'' cannot be easily combined to obtain \"Air Canada''.  Motivated by this example, we present a simple and efficient method for finding phrases, and show that their vector representations can be accurately learned by the Skip-gram model. \"", "full_text": "Distributed Representations of Words and Phrases\n\nand their Compositionality\n\nTomas Mikolov\n\nGoogle Inc.\n\nMountain View\n\nIlya Sutskever\n\nGoogle Inc.\n\nMountain View\n\nKai Chen\nGoogle Inc.\n\nMountain View\n\nmikolov@google.com\n\nilyasu@google.com\n\nkai@google.com\n\nGreg Corrado\n\nGoogle Inc.\n\nMountain View\n\nJeffrey Dean\nGoogle Inc.\n\nMountain View\n\ngcorrado@google.com\n\njeff@google.com\n\nAbstract\n\nThe recently introduced continuous Skip-gram model is an ef\ufb01cient method for\nlearning high-quality distributed vector representations that capture a large num-\nber of precise syntactic and semantic word relationships. In this paper we present\nseveral extensions that improve both the quality of the vectors and the training\nspeed. By subsampling of the frequent words we obtain signi\ufb01cant speedup and\nalso learn more regular word representations. We also describe a simple alterna-\ntive to the hierarchical softmax called negative sampling.\nAn inherent limitation of word representations is their indifference to word order\nand their inability to represent idiomatic phrases. For example, the meanings of\n\u201cCanada\u201d and \u201cAir\u201d cannot be easily combined to obtain \u201cAir Canada\u201d. Motivated\nby this example, we present a simple method for \ufb01nding phrases in text, and show\nthat learning good vector representations for millions of phrases is possible.\n\n1 Introduction\n\nDistributed representations of words in a vector space help learning algorithms to achieve better\nperformance in natural language processing tasks by grouping similar words. One of the earliest use\nof word representations dates back to 1986 due to Rumelhart, Hinton, and Williams [13]. This idea\nhas since been applied to statistical language modeling with considerable success [1]. The follow\nup work includes applications to automatic speech recognition and machine translation [14, 7], and\na wide range of NLP tasks [2, 20, 15, 3, 18, 19, 9].\n\nRecently, Mikolov et al. [8] introduced the Skip-gram model, an ef\ufb01cient method for learning high-\nquality vector representations of words from large amounts of unstructured text data. Unlike most\nof the previously used neural network architectures for learning word vectors, training of the Skip-\ngram model (see Figure 1) does not involve dense matrix multiplications. This makes the training\nextremely ef\ufb01cient: an optimized single-machine implementation can train on more than 100 billion\nwords in one day.\n\nThe word representations computed using neural networks are very interesting because the learned\nvectors explicitly encode many linguistic regularities and patterns. Somewhat surprisingly, many of\nthese patterns can be represented as linear translations. For example, the result of a vector calcula-\ntion vec(\u201cMadrid\u201d) - vec(\u201cSpain\u201d) + vec(\u201cFrance\u201d) is closer to vec(\u201cParis\u201d) than to any other word\nvector [9, 8].\n\n1\n\n\f(cid:5)(cid:6)(cid:7)(cid:8)(cid:3)(cid:9)(cid:9)(cid:9)(cid:9)(cid:9)(cid:9)(cid:9)(cid:9)(cid:9)(cid:9)(cid:9)(cid:7)(cid:10)(cid:11)(cid:12)(cid:13)(cid:14)(cid:3)(cid:15)(cid:11)(cid:6)(cid:9)(cid:9)(cid:9)(cid:9)(cid:9)(cid:9)(cid:11)(cid:8)(cid:3)(cid:7)(cid:8)(cid:3)\n\n(cid:1)(cid:2)(cid:3)(cid:4)\n\n(cid:1)(cid:2)(cid:3)(cid:16)(cid:17)(cid:4)\n\n(cid:1)(cid:2)(cid:3)(cid:16)(cid:18)(cid:4)\n\n(cid:1)(cid:2)(cid:3)(cid:19)(cid:18)(cid:4)\n\n(cid:1)(cid:2)(cid:3)(cid:19)(cid:17)(cid:4)\n\nFigure 1: The Skip-gram model architecture. The training objective is to learn word vector representations\nthat are good at predicting the nearby words.\n\nIn this paper we present several extensions of the original Skip-gram model. We show that sub-\nsampling of frequent words during training results in a signi\ufb01cant speedup (around 2x - 10x), and\nimproves accuracy of the representations of less frequent words. In addition, we present a simpli-\n\ufb01ed variant of Noise Contrastive Estimation (NCE) [4] for training the Skip-gram model that results\nin faster training and better vector representations for frequent words, compared to more complex\nhierarchical softmax that was used in the prior work [8].\n\nWord representations are limited by their inability to represent idiomatic phrases that are not com-\npositions of the individual words. For example, \u201cBoston Globe\u201d is a newspaper, and so it is not a\nnatural combination of the meanings of \u201cBoston\u201d and \u201cGlobe\u201d. Therefore, using vectors to repre-\nsent the whole phrases makes the Skip-gram model considerably more expressive. Other techniques\nthat aim to represent meaning of sentences by composing the word vectors, such as the recursive\nautoencoders [15], would also bene\ufb01t from using phrase vectors instead of the word vectors.\n\nThe extension from word based to phrase based models is relatively simple. First we identify a large\nnumber of phrases using a data-driven approach, and then we treat the phrases as individual tokens\nduring the training. To evaluate the quality of the phrase vectors, we developed a test set of analogi-\ncal reasoning tasks that contains both words and phrases. A typical analogy pair from our test set is\n\u201cMontreal\u201d:\u201cMontreal Canadiens\u201d::\u201cToronto\u201d:\u201cToronto Maple Leafs\u201d. It is considered to have been\nanswered correctly if the nearest representation to vec(\u201cMontreal Canadiens\u201d) - vec(\u201cMontreal\u201d) +\nvec(\u201cToronto\u201d) is vec(\u201cToronto Maple Leafs\u201d).\n\nFinally, we describe another interesting property of the Skip-gram model. We found that simple\nvector addition can often produce meaningful results. For example, vec(\u201cRussia\u201d) + vec(\u201criver\u201d) is\nclose to vec(\u201cVolga River\u201d), and vec(\u201cGermany\u201d) + vec(\u201ccapital\u201d) is close to vec(\u201cBerlin\u201d). This\ncompositionality suggests that a non-obvious degree of language understanding can be obtained by\nusing basic mathematical operations on the word vector representations.\n\n2 The Skip-gram Model\n\nThe training objective of the Skip-gram model is to \ufb01nd word representations that are useful for\npredicting the surrounding words in a sentence or a document. More formally, given a sequence of\ntraining words w1, w2, w3, . . . , wT , the objective of the Skip-gram model is to maximize the average\nlog probability\n\n1\nT\n\nT\n\nXt=1 X\u2212c\u2264j\u2264c,j6=0\n\nlog p(wt+j |wt)\n\n(1)\n\nwhere c is the size of the training context (which can be a function of the center word wt). Larger\nc results in more training examples and thus can lead to a higher accuracy, at the expense of the\n\n2\n\n\ftraining time. The basic Skip-gram formulation de\ufb01nes p(wt+j |wt) using the softmax function:\n\np(wO|wI ) =\n\nexp(cid:16)v\u2032\nw=1 exp(cid:16)v\u2032\nPW\n\n\u22a4vwI(cid:17)\n\u22a4vwI(cid:17)\n\nwO\n\nw\n\n(2)\n\nwhere vw and v\u2032\nw are the \u201cinput\u201d and \u201coutput\u201d vector representations of w, and W is the num-\nber of words in the vocabulary. This formulation is impractical because the cost of computing\n\u2207 log p(wO|wI ) is proportional to W , which is often large (105\u2013107 terms).\n\n2.1 Hierarchical Softmax\n\nA computationally ef\ufb01cient approximation of the full softmax is the hierarchical softmax. In the\ncontext of neural network language models, it was \ufb01rst introduced by Morin and Bengio [12]. The\nmain advantage is that instead of evaluating W output nodes in the neural network to obtain the\nprobability distribution, it is needed to evaluate only about log2(W ) nodes.\nThe hierarchical softmax uses a binary tree representation of the output layer with the W words as\nits leaves and, for each node, explicitly represents the relative probabilities of its child nodes. These\nde\ufb01ne a random walk that assigns probabilities to words.\n\nMore precisely, each word w can be reached by an appropriate path from the root of the tree. Let\nn(w, j) be the j-th node on the path from the root to w, and let L(w) be the length of this path, so\nn(w, 1) = root and n(w, L(w)) = w. In addition, for any inner node n, let ch(n) be an arbitrary\n\ufb01xed child of n and let [[x]] be 1 if x is true and -1 otherwise. Then the hierarchical softmax de\ufb01nes\np(wO|wI ) as follows:\n\np(w|wI ) =\n\nL(w)\u22121\n\nYj=1\n\n\u03c3(cid:16)[[n(w, j + 1) = ch(n(w, j))]] \u00b7 v\u2032\n\nn(w,j)\n\n\u22a4vwI(cid:17)\n\n(3)\n\nwhere \u03c3(x) = 1/(1 + exp(\u2212x)). It can be veri\ufb01ed thatPW\n\nw=1 p(w|wI ) = 1. This implies that the\ncost of computing log p(wO|wI ) and \u2207 log p(wO|wI ) is proportional to L(wO), which on average\nis no greater than log W . Also, unlike the standard softmax formulation of the Skip-gram which\nw to each word w, the hierarchical softmax formulation has\nassigns two representations vw and v\u2032\none representation vw for each word w and one representation v\u2032\nn for every inner node n of the\nbinary tree.\n\nThe structure of the tree used by the hierarchical softmax has a considerable effect on the perfor-\nmance. Mnih and Hinton explored a number of methods for constructing the tree structure and the\neffect on both the training time and the resulting model accuracy [10]. In our work we use a binary\nHuffman tree, as it assigns short codes to the frequent words which results in fast training. It has\nbeen observed before that grouping words together by their frequency works well as a very simple\nspeedup technique for the neural network based language models [5, 8].\n\n2.2 Negative Sampling\n\nAn alternative to the hierarchical softmax is Noise Contrastive Estimation (NCE), which was in-\ntroduced by Gutmann and Hyvarinen [4] and applied to language modeling by Mnih and Teh [11].\nNCE posits that a good model should be able to differentiate data from noise by means of logistic\nregression. This is similar to hinge loss used by Collobert and Weston [2] who trained the models\nby ranking the data above noise.\n\nWhile NCE can be shown to approximately maximize the log probability of the softmax, the Skip-\ngram model is only concerned with learning high-quality vector representations, so we are free to\nsimplify NCE as long as the vector representations retain their quality. We de\ufb01ne Negative sampling\n(NEG) by the objective\n\nlog \u03c3(v\u2032\n\nwO\n\n\u22a4vwI ) +\n\nk\n\nXi=1\n\nEwi\u223cPn(w)hlog \u03c3(\u2212v\u2032\n\nwi\n\n\u22a4vwI )i\n\n(4)\n\n3\n\n\fCountry and Capital Vectors Projected by PCA\n\n 2\n\n 1.5\n\n 1\n\n 0.5\n\n 0\n\nChina\n\nRussia\n\nJapan\n\nTurkey\n\nPoland\n\nGermany\n\nFrance\n\n-0.5\n\nItaly\n\n-1\n\nSpain\n\nGreece\n\n-1.5\n\nPortugal\n\n-2\n\n-2\n\nBeijing\n\nMoscow\n\nAnkara\n\nTokyo\n\nWarsaw\n\nBerlin\n\nParis\n\nAthens\n\nRome\n\nMadrid\n\nLisbon\n\n-1.5\n\n-1\n\n-0.5\n\n 0\n\n 0.5\n\n 1\n\n 1.5\n\n 2\n\nFigure 2: Two-dimensional PCA projection of the 1000-dimensional Skip-gram vectors of countries and their\ncapital cities. The \ufb01gure illustrates ability of the model to automatically organize concepts and learn implicitly\nthe relationships between them, as during the training we did not provide any supervised information about\nwhat a capital city means.\n\nwhich is used to replace every log P (wO|wI ) term in the Skip-gram objective. Thus the task is to\ndistinguish the target word wO from draws from the noise distribution Pn(w) using logistic regres-\nsion, where there are k negative samples for each data sample. Our experiments indicate that values\nof k in the range 5\u201320 are useful for small training datasets, while for large datasets the k can be as\nsmall as 2\u20135. The main difference between the Negative sampling and NCE is that NCE needs both\nsamples and the numerical probabilities of the noise distribution, while Negative sampling uses only\nsamples. And while NCE approximately maximizes the log probability of the softmax, this property\nis not important for our application.\nBoth NCE and NEG have the noise distribution Pn(w) as a free parameter. We investigated a number\nof choices for Pn(w) and found that the unigram distribution U (w) raised to the 3/4rd power (i.e.,\nU (w)3/4/Z) outperformed signi\ufb01cantly the unigram and the uniform distributions, for both NCE\nand NEG on every task we tried including language modeling (not reported here).\n\n2.3 Subsampling of Frequent Words\n\nIn very large corpora, the most frequent words can easily occur hundreds of millions of times (e.g.,\n\u201cin\u201d, \u201cthe\u201d, and \u201ca\u201d). Such words usually provide less information value than the rare words. For\nexample, while the Skip-gram model bene\ufb01ts from observing the co-occurrences of \u201cFrance\u201d and\n\u201cParis\u201d, it bene\ufb01ts much less from observing the frequent co-occurrences of \u201cFrance\u201d and \u201cthe\u201d, as\nnearly every word co-occurs frequently within a sentence with \u201cthe\u201d. This idea can also be applied\nin the opposite direction; the vector representations of frequent words do not change signi\ufb01cantly\nafter training on several million examples.\n\nTo counter the imbalance between the rare and frequent words, we used a simple subsampling ap-\nproach: each word wi in the training set is discarded with probability computed by the formula\n\nP (wi) = 1 \u2212s t\n\nf (wi)\n\n4\n\n(5)\n\n\fMethod\nNEG-5\nNEG-15\n\nHS-Huffman\n\nNCE-5\n\nNEG-5\nNEG-15\n\nHS-Huffman\n\nTime [min]\n\nSyntactic [%]\n\nSemantic [%]\n\nTotal accuracy [%]\n\n54\n58\n40\n45\n\n38\n97\n41\n38\nThe following results use 10\u22125 subsampling\n14\n36\n21\n\n58\n61\n59\n\n63\n63\n53\n60\n\n61\n61\n52\n\n59\n61\n47\n53\n\n60\n61\n55\n\nTable 1: Accuracy of various Skip-gram 300-dimensional models on the analogical reasoning task\nas de\ufb01ned in [8]. NEG-k stands for Negative Sampling with k negative samples for each positive\nsample; NCE stands for Noise Contrastive Estimation and HS-Huffman stands for the Hierarchical\nSoftmax with the frequency-based Huffman codes.\n\nwhere f (wi) is the frequency of word wi and t is a chosen threshold, typically around 10\u22125.\nWe chose this subsampling formula because it aggressively subsamples words whose frequency\nis greater than t while preserving the ranking of the frequencies. Although this subsampling for-\nmula was chosen heuristically, we found it to work well in practice. It accelerates learning and even\nsigni\ufb01cantly improves the accuracy of the learned vectors of the rare words, as will be shown in the\nfollowing sections.\n\n3 Empirical Results\n\nIn this section we evaluate the Hierarchical Softmax (HS), Noise Contrastive Estimation, Negative\nSampling, and subsampling of the training words. We used the analogical reasoning task1 introduced\nby Mikolov et al. [8]. The task consists of analogies such as \u201cGermany\u201d : \u201cBerlin\u201d :: \u201cFrance\u201d : ?,\nwhich are solved by \ufb01nding a vector x such that vec(x) is closest to vec(\u201cBerlin\u201d) - vec(\u201cGermany\u201d)\n+ vec(\u201cFrance\u201d) according to the cosine distance (we discard the input words from the search). This\nspeci\ufb01c example is considered to have been answered correctly if x is \u201cParis\u201d. The task has two\nbroad categories: the syntactic analogies (such as \u201cquick\u201d : \u201cquickly\u201d :: \u201cslow\u201d : \u201cslowly\u201d) and the\nsemantic analogies, such as the country to capital city relationship.\n\nFor training the Skip-gram models, we have used a large dataset consisting of various news articles\n(an internal Google dataset with one billion words). We discarded from the vocabulary all words\nthat occurred less than 5 times in the training data, which resulted in a vocabulary of size 692K.\nThe performance of various Skip-gram models on the word analogy test set is reported in Table 1.\nThe table shows that Negative Sampling outperforms the Hierarchical Softmax on the analogical\nreasoning task, and has even slightly better performance than the Noise Contrastive Estimation. The\nsubsampling of the frequent words improves the training speed several times and makes the word\nrepresentations signi\ufb01cantly more accurate.\n\nIt can be argued that the linearity of the skip-gram model makes its vectors more suitable for such\nlinear analogical reasoning, but the results of Mikolov et al. [8] also show that the vectors learned\nby the standard sigmoidal recurrent neural networks (which are highly non-linear) improve on this\ntask signi\ufb01cantly as the amount of the training data increases, suggesting that non-linear models also\nhave a preference for a linear structure of the word representations.\n\n4 Learning Phrases\n\nAs discussed earlier, many phrases have a meaning that is not a simple composition of the mean-\nings of its individual words. To learn vector representation for phrases, we \ufb01rst \ufb01nd words that\nappear frequently together, and infrequently in other contexts. For example, \u201cNew York Times\u201d and\n\u201cToronto Maple Leafs\u201d are replaced by unique tokens in the training data, while a bigram \u201cthis is\u201d\nwill remain unchanged.\n\n1code.google.com/p/word2vec/source/browse/trunk/questions-words.txt\n\n5\n\n\fNew York\nSan Jose\n\nNew York Times\n\nSan Jose Mercury News\n\nBaltimore\nCincinnati\n\nBaltimore Sun\n\nCincinnati Enquirer\n\nNewspapers\n\nBoston\nPhoenix\n\nDetroit\nOakland\n\nAustria\nBelgium\n\nNHL Teams\n\nBoston Bruins\nPhoenix Coyotes\n\nNBA Teams\n\nDetroit Pistons\n\nGolden State Warriors\n\nAirlines\n\nAustrian Airlines\nBrussels Airlines\n\nMontreal\nNashville\n\nToronto\nMemphis\n\nSpain\nGreece\n\nCompany executives\n\nMontreal Canadiens\nNashville Predators\n\nToronto Raptors\n\nMemphis Grizzlies\n\nSpainair\n\nAegean Airlines\n\nSteve Ballmer\n\nSamuel J. Palmisano\n\nMicrosoft\n\nIBM\n\nLarry Page\n\nWerner Vogels\n\nGoogle\nAmazon\n\nTable 2: Examples of the analogical reasoning task for phrases (the full test set has 3218 examples).\nThe goal is to compute the fourth phrase using the \ufb01rst three. Our best model achieved an accuracy\nof 72% on this dataset.\n\nThis way, we can form many reasonable phrases without greatly increasing the size of the vocabu-\nlary; in theory, we can train the Skip-gram model using all n-grams, but that would be too memory\nintensive. Many techniques have been previously developed to identify phrases in the text; however,\nit is out of scope of our work to compare them. We decided to use a simple data-driven approach,\nwhere phrases are formed based on the unigram and bigram counts, using\n\nscore(wi, wj) =\n\ncount(wiwj) \u2212 \u03b4\n\ncount(wi) \u00d7 count(wj )\n\n.\n\n(6)\n\nThe \u03b4 is used as a discounting coef\ufb01cient and prevents too many phrases consisting of very infre-\nquent words to be formed. The bigrams with score above the chosen threshold are then used as\nphrases. Typically, we run 2-4 passes over the training data with decreasing threshold value, allow-\ning longer phrases that consists of several words to be formed. We evaluate the quality of the phrase\nrepresentations using a new analogical reasoning task that involves phrases. Table 2 shows examples\nof the \ufb01ve categories of analogies used in this task. This dataset is publicly available on the web2.\n\n4.1 Phrase Skip-Gram Results\n\nStarting with the same news data as in the previous experiments, we \ufb01rst constructed the phrase\nbased training corpus and then we trained several Skip-gram models using different hyper-\nparameters. As before, we used vector dimensionality 300 and context size 5. This setting already\nachieves good performance on the phrase dataset, and allowed us to quickly compare the Negative\nSampling and the Hierarchical Softmax, both with and without subsampling of the frequent tokens.\nThe results are summarized in Table 3.\n\nThe results show that while Negative Sampling achieves a respectable accuracy even with k = 5,\nusing k = 15 achieves considerably better performance. Surprisingly, while we found the Hierar-\nchical Softmax to achieve lower performance when trained without subsampling, it became the best\nperforming method when we downsampled the frequent words. This shows that the subsampling\ncan result in faster training and can also improve accuracy, at least in some cases.\n\n2code.google.com/p/word2vec/source/browse/trunk/questions-phrases.txt\n\nMethod\nNEG-5\nNEG-15\n\nHS-Huffman\n\nDimensionality No subsampling [%]\n\n10\u22125 subsampling [%]\n\n300\n300\n300\n\n24\n27\n19\n\n27\n42\n47\n\nTable 3: Accuracies of the Skip-gram models on the phrase analogy dataset. The models were\ntrained on approximately one billion words from the news dataset.\n\n6\n\n\fNEG-15 with 10\u22125 subsampling HS with 10\u22125 subsampling\n\nVasco de Gama\n\nLake Baikal\nAlan Bean\nIonian Sea\nchess master\n\nLingsugur\n\nGreat Rift Valley\nRebbeca Naomi\n\nRuegen\n\nchess grandmaster\n\nItalian explorer\n\nAral Sea\n\nmoonwalker\nIonian Islands\nGarry Kasparov\n\nTable 4: Examples of the closest entities to the given short phrases, using two different models.\n\nCzech + currency Vietnam + capital\n\nkoruna\n\nCheck crown\nPolish zolty\n\nCTK\n\nHanoi\n\nHo Chi Minh City\n\nViet Nam\nVietnamese\n\nGerman + airlines\nairline Lufthansa\ncarrier Lufthansa\n\n\ufb02ag carrier Lufthansa\n\nLufthansa\n\nRussian + river\n\nMoscow\n\nVolga River\n\nupriver\nRussia\n\nFrench + actress\nJuliette Binoche\nVanessa Paradis\n\nCharlotte Gainsbourg\n\nCecile De\n\nTable 5: Vector compositionality using element-wise addition. Four closest tokens to the sum of two\nvectors are shown, using the best Skip-gram model.\n\nTo maximize the accuracy on the phrase analogy task, we increased the amount of the training data\nby using a dataset with about 33 billion words. We used the hierarchical softmax, dimensionality\nof 1000, and the entire sentence for the context. This resulted in a model that reached an accuracy\nof 72%. We achieved lower accuracy 66% when we reduced the size of the training dataset to 6B\nwords, which suggests that the large amount of the training data is crucial.\n\nTo gain further insight into how different the representations learned by different models are, we did\ninspect manually the nearest neighbours of infrequent phrases using various models. In Table 4, we\nshow a sample of such comparison. Consistently with the previous results, it seems that the best\nrepresentations of phrases are learned by a model with the hierarchical softmax and subsampling.\n\n5 Additive Compositionality\n\nWe demonstrated that the word and phrase representations learned by the Skip-gram model exhibit\na linear structure that makes it possible to perform precise analogical reasoning using simple vector\narithmetics. Interestingly, we found that the Skip-gram representations exhibit another kind of linear\nstructure that makes it possible to meaningfully combine words by an element-wise addition of their\nvector representations. This phenomenon is illustrated in Table 5.\n\nThe additive property of the vectors can be explained by inspecting the training objective. The word\nvectors are in a linear relationship with the inputs to the softmax nonlinearity. As the word vectors\nare trained to predict the surrounding words in the sentence, the vectors can be seen as representing\nthe distribution of the context in which a word appears. These values are related logarithmically\nto the probabilities computed by the output layer, so the sum of two word vectors is related to the\nproduct of the two context distributions. The product works here as the AND function: words that\nare assigned high probabilities by both word vectors will have high probability, and the other words\nwill have low probability. Thus, if \u201cVolga River\u201d appears frequently in the same sentence together\nwith the words \u201cRussian\u201d and \u201criver\u201d, the sum of these two word vectors will result in such a feature\nvector that is close to the vector of \u201cVolga River\u201d.\n\n6 Comparison to Published Word Representations\n\nMany authors who previously worked on the neural network based representations of words have\npublished their resulting models for further use and comparison: amongst the most well known au-\nthors are Collobert and Weston [2], Turian et al. [17], and Mnih and Hinton [10]. We downloaded\ntheir word vectors from the web3. Mikolov et al. [8] have already evaluated these word representa-\ntions on the word analogy task, where the Skip-gram models achieved the best performance with a\nhuge margin.\n\n3http://metaoptimize.com/projects/wordreprs/\n\n7\n\n\fHavel\n\nninjutsu\n\ngraf\ufb01ti\n\ncapitulate\n\nplauen\n\ndzerzhinsky\nosterreich\n\nJewell\nArzu\nOvitz\nPontiff\nPinochet\nRodionov\n\nreiki\nkohona\nkarate\n\n-\n-\n-\n-\n-\n-\n\ncheesecake\n\ngossip\ndioramas\ngun\ufb01re\nemotion\nimpunity\n\nanaesthetics\n\nmonkeys\n\nJews\n\nspray paint\n\ngra\ufb01tti\ntaggers\n\nabdicate\naccede\nrearm\n\n-\n-\n-\n\nMavericks\nplanning\nhesitated\n\ncapitulation\ncapitulated\ncapitulating\n\nModel\n\n(training time)\nCollobert (50d)\n\n(2 months)\n\nTurian (200d)\n(few weeks)\n\nMnih (100d)\n\n(7 days)\n\nRedmond\n\nconyers\nlubbock\nkeene\n\nMcCarthy\n\nAlston\nCousins\nPodhurst\nHarlang\nAgarwal\n\nSkip-Phrase\n(1000d, 1 day)\n\nRedmond Wash.\n\nVaclav Havel\n\nninja\n\nRedmond Washington\n\npresident Vaclav Havel\n\nmartial arts\n\nMicrosoft\n\nVelvet Revolution\n\nswordsmanship\n\nTable 6: Examples of the closest tokens given various well known models and the Skip-gram model\ntrained on phrases using over 30 billion training words. An empty cell means that the word was not\nin the vocabulary.\n\nTo give more insight into the difference of the quality of the learned vectors, we provide empirical\ncomparison by showing the nearest neighbours of infrequent words in Table 6. These examples show\nthat the big Skip-gram model trained on a large corpus visibly outperforms all the other models in\nthe quality of the learned representations. This can be attributed in part to the fact that this model\nhas been trained on about 30 billion words, which is about two to three orders of magnitude more\ndata than the typical size used in the prior work. Interestingly, although the training set is much\nlarger, the training time of the Skip-gram model is just a fraction of the time complexity required by\nthe previous model architectures.\n\n7 Conclusion\n\nThis work has several key contributions. We show how to train distributed representations of words\nand phrases with the Skip-gram model and demonstrate that these representations exhibit linear\nstructure that makes precise analogical reasoning possible. The techniques introduced in this paper\ncan be used also for training the continuous bag-of-words model introduced in [8].\n\nWe successfully trained models on several orders of magnitude more data than the previously pub-\nlished models, thanks to the computationally ef\ufb01cient model architecture. This results in a great\nimprovement in the quality of the learned word and phrase representations, especially for the rare\nentities. We also found that the subsampling of the frequent words results in both faster training\nand signi\ufb01cantly better representations of uncommon words. Another contribution of our paper is\nthe Negative sampling algorithm, which is an extremely simple training method that learns accurate\nrepresentations especially for frequent words.\n\nThe choice of the training algorithm and the hyper-parameter selection is a task speci\ufb01c decision,\nas we found that different problems have different optimal hyperparameter con\ufb01gurations. In our\nexperiments, the most crucial decisions that affect the performance are the choice of the model\narchitecture, the size of the vectors, the subsampling rate, and the size of the training window.\n\nA very interesting result of this work is that the word vectors can be somewhat meaningfully com-\nbined using just simple vector addition. Another approach for learning representations of phrases\npresented in this paper is to simply represent the phrases with a single token. Combination of these\ntwo approaches gives a powerful yet simple way how to represent longer pieces of text, while hav-\ning minimal computational complexity. Our work can thus be seen as complementary to the existing\napproach that attempts to represent phrases using recursive matrix-vector operations [16].\n\nWe made the code for training the word and phrase vectors based on the techniques described in this\npaper available as an open-source project4.\n\n4code.google.com/p/word2vec\n\n8\n\n\fReferences\n\n[1] Yoshua Bengio, R\u00b4ejean Ducharme, Pascal Vincent, and Christian Janvin. A neural probabilistic language\n\nmodel. The Journal of Machine Learning Research, 3:1137\u20131155, 2003.\n\n[2] Ronan Collobert and Jason Weston. A uni\ufb01ed architecture for natural language processing: deep neu-\nral networks with multitask learning. In Proceedings of the 25th international conference on Machine\nlearning, pages 160\u2013167. ACM, 2008.\n\n[3] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Domain adaptation for large-scale sentiment classi-\n\n\ufb01cation: A deep learning approach. In ICML, 513\u2013520, 2011.\n\n[4] Michael U Gutmann and Aapo Hyv\u00a8arinen. Noise-contrastive estimation of unnormalized statistical mod-\nels, with applications to natural image statistics. The Journal of Machine Learning Research, 13:307\u2013361,\n2012.\n\n[5] Tomas Mikolov, Stefan Kombrink, Lukas Burget, Jan Cernocky, and Sanjeev Khudanpur. Extensions of\nrecurrent neural network language model. In Acoustics, Speech and Signal Processing (ICASSP), 2011\nIEEE International Conference on, pages 5528\u20135531. IEEE, 2011.\n\n[6] Tomas Mikolov, Anoop Deoras, Daniel Povey, Lukas Burget and Jan Cernocky. Strategies for Training\nLarge Scale Neural Network Language Models. In Proc. Automatic Speech Recognition and Understand-\ning, 2011.\n\n[7] Tomas Mikolov. Statistical Language Models Based on Neural Networks. PhD thesis, PhD Thesis, Brno\n\nUniversity of Technology, 2012.\n\n[8] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Ef\ufb01cient estimation of word representations\n\nin vector space. ICLR Workshop, 2013.\n\n[9] Tomas Mikolov, Wen-tau Yih and Geoffrey Zweig. Linguistic Regularities in Continuous Space Word\n\nRepresentations. In Proceedings of NAACL HLT, 2013.\n\n[10] Andriy Mnih and Geoffrey E Hinton. A scalable hierarchical distributed language model. Advances in\n\nneural information processing systems, 21:1081\u20131088, 2009.\n\n[11] Andriy Mnih and Yee Whye Teh. A fast and simple algorithm for training neural probabilistic language\n\nmodels. arXiv preprint arXiv:1206.6426, 2012.\n\n[12] Frederic Morin and Yoshua Bengio. Hierarchical probabilistic neural network language model. In Pro-\n\nceedings of the international workshop on arti\ufb01cial intelligence and statistics, pages 246\u2013252, 2005.\n\n[13] David E Rumelhart, Geoffrey E Hintont, and Ronald J Williams. Learning representations by back-\n\npropagating errors. Nature, 323(6088):533\u2013536, 1986.\n\n[14] Holger Schwenk. Continuous space language models. Computer Speech and Language, vol. 21, 2007.\n[15] Richard Socher, Cliff C. Lin, Andrew Y. Ng, and Christopher D. Manning. Parsing natural scenes and\nnatural language with recursive neural networks. In Proceedings of the 26th International Conference on\nMachine Learning (ICML), volume 2, 2011.\n\n[16] Richard Socher, Brody Huval, Christopher D. Manning, and Andrew Y. Ng. Semantic Compositionality\nThrough Recursive Matrix-Vector Spaces. In Proceedings of the 2012 Conference on Empirical Methods\nin Natural Language Processing (EMNLP), 2012.\n\n[17] Joseph Turian, Lev Ratinov, and Yoshua Bengio. Word representations: a simple and general method for\nsemi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computa-\ntional Linguistics, pages 384\u2013394. Association for Computational Linguistics, 2010.\n\n[18] Peter D. Turney and Patrick Pantel. From frequency to meaning: Vector space models of semantics. In\n\nJournal of Arti\ufb01cial Intelligence Research, 37:141-188, 2010.\n\n[19] Peter D. Turney. Distributional semantics beyond words: Supervised learning of analogy and paraphrase.\n\nIn Transactions of the Association for Computational Linguistics (TACL), 353\u2013366, 2013.\n\n[20] Jason Weston, Samy Bengio, and Nicolas Usunier. Wsabie: Scaling up to large vocabulary image annota-\ntion. In Proceedings of the Twenty-Second international joint conference on Arti\ufb01cial Intelligence-Volume\nVolume Three, pages 2764\u20132770. AAAI Press, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1421, "authors": [{"given_name": "Tomas", "family_name": "Mikolov", "institution": "Google Research"}, {"given_name": "Ilya", "family_name": "Sutskever", "institution": "Google Research"}, {"given_name": "Kai", "family_name": "Chen", "institution": "Google Research"}, {"given_name": "Greg", "family_name": "Corrado", "institution": "Google Research"}, {"given_name": "Jeff", "family_name": "Dean", "institution": "Google Research"}]}