Word2Vec

Word2Vec is a class of two-layer neural network models that, given an unlabelled training corpus, produce a vector for each word in the corpus that encodes its semantic information. Hence, Word2Vec processes text, it takes an input text corpus and outputs a feature vectors for words in that corpus, these numerical vectors can be understood by deep nets. Word2Vec offers an advanced vector representations of a term, along with other distributed representations. These are based on the distributional hypothesis, which motivates that the meaning of a word can be gauged by its context. Hence, if two words occur in the same position in two sentences, they are very much related either in semantics or syntactics.

We can measure the semantic similarity between two words by calculating the cosine similarity between their corresponding word vectors. It captures latent sentiments in word embeddings and gives a vector representation.

Pennington, Socher, and Manning shows that the Word2Vec model is equivalent to weighting a word co-occurrence matrix weighting based on window distance and lowering the dimension by matrix factorization. This is done by making a continuous bag of words (CBOW) and Skip-grams. In CBOW, the goal is to predict a word given the surrounding words while in Skip-grams, a window of the surrounding words is predicted given a single word. The word vectors thus generated can be fed now capture the context of the surrounding word.

There are various Word2Vec resources, pretrained librarires or corpuses available, these include:

  • Stanford Glove
    • It is a 5.7 Gb GloVe data set containing 2.2 million words, which are translated to 840 billion tokens and each is represented as a 300-dimensional vectors.
  • Google Word2Vec
    • This is a 3.7 Gb pre-trained vectors trained on part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases.
  • Gensim
    • Gensim is a library that realizes unsupervised semantic modelling from plain text.


blog comments powered by Disqus

Published

09 November 2015

Tags