Word Embedding Models

Word2Vec

The most commonly used word-embedding model is Word2Vec. Its model can be downloaded from their page. To load the model, call:

>>> import shorttext
>>> wvmodel = shorttext.utils.load_word2vec_model('/path/to/GoogleNews-vectors-negative300.bin.gz')

It is a binary file, and the default is set to be binary=True.

shorttext.utils.wordembed.load_word2vec_model(path, binary=True)

Load a pre-trained Word2Vec model.

Parameters:
  • path (str) – path of the file of the pre-trained Word2Vec model
  • binary (bool) – whether the file is in binary format (Default: True)
Returns:

a pre-trained Word2Vec model

Return type:

gensim.models.keyedvectors.KeyedVectors

It is equivalent to calling,

>>> import gensim
>>> wvmodel = gensim.models.KeyedVectors.load_word2vec_format('/path/to/GoogleNews-vectors-negative300.bin.gz', binary=True)

Word2Vec is a neural network model that embeds words into semantic vectors that carry semantic meaning. It is easy to extract the vector of a word, like for the word ‘coffee’:

>>> wvmodel['coffee']   # an ndarray for the word will be output

One can find the most similar words to ‘coffee’ according to this model:

>>> wvmodel.most_similar('coffee')

which outputs:

[(u'coffees', 0.721267819404602),
 (u'gourmet_coffee', 0.7057087421417236),
 (u'Coffee', 0.6900454759597778),
 (u'o_joe', 0.6891065835952759),
 (u'Starbucks_coffee', 0.6874972581863403),
 (u'coffee_beans', 0.6749703884124756),
 (u'latt\xe9', 0.664122462272644),
 (u'cappuccino', 0.662549614906311),
 (u'brewed_coffee', 0.6621608138084412),
 (u'espresso', 0.6616827249526978)]

Or if you want to find the cosine similarity between ‘coffee’ and ‘tea’, enter:

>>> wvmodel.similarity('coffee', 'tea')   # outputs: 0.56352921707810621

Semantic meaning can be reflected by their differences. For example, we can vaguely say Francis - Paris = Taiwan - Taipei, or man - actor = woman - actress. Define first the cosine similarity for readability:

>>> from scipy.spatial.distance import cosine
>>> similarity = lambda u, v: 1-cosine(u, v)

Then

>>> similarity(wvmodel['France'] + wvmodel['Taipei'] - wvmodel['Taiwan'], wvmodel['Paris'])  # outputs: 0.70574580801216202
>>> similarity(wvmodel['woman'] + wvmodel['actor'] - wvmodel['man'], wvmodel['actress'])  # outputs: 0.876354245612604

GloVe

Stanford NLP Group developed a similar word-embedding algorithm, with a good theory explaining how it works. It is extremely similar to Word2Vec.

One can convert a text-format GloVe model into a text-format Word2Vec model. More information can be found in the documentation of gensim: Converting GloVe to Word2Vec

FastText

FastText is a similar word-embedding model from Facebook. You can download pre-trained models here:

Pre-trained word vectors

To load a pre-trained FastText model, run:

>>> import shorttext
>>> ftmodel = shorttext.utils.load_fasttext_model('/path/to/model.bin')

And it is used exactly the same way as Word2Vec.

shorttext.utils.wordembed.load_fasttext_model(path, encoding='utf-8')

Load a pre-trained FastText model.

Parameters:path (str) – path of the file of the pre-trained FastText model
Returns:a pre-trained FastText model
Return type:gensim.models.keyedvectors.FastTextKeyedVectors

Poincaré Embeddings

Poincaré embeddings is a new embedding that learns both semantic similarity and hierarchical structures. To load a pre-trained model, run:

>>> import shorttext
>>> pemodel = shorttext.utils.load_poincare_model('/path/to/model.txt')

For preloaded word-embedding models, please refer to Word Embedding Models.

shorttext.utils.wordembed.load_poincare_model(path, word2vec_format=True, binary=False)

Load a Poincare embedding model.

Parameters:
  • path (str) – path of the file of the pre-trained Poincare embedding model
  • word2vec_format (bool) – whether to load from word2vec format (default: True)
  • binary (bool) – binary format (default: False)
Returns:

a pre-trained Poincare embedding model

Return type:

gensim.models.poincare.PoincareKeyedVectors

BERT

BERT (Bidirectional Transformers for Language Understanding) is a transformer-based language model. This package supports tokens and sentence embeddings using pre-trained language models, supported by the package written by HuggingFace. In shorttext, to run:

>>> from shorttext.utils import WrappedBERTEncoder
>>> encoder = WrappedBERTEncoder()   # the default model and tokenizer are loaded
>>> sentences_embedding, tokens_embedding, tokens = encoder.encode_sentences(['The car should turn right.', 'The answer is right.'])

The third line returns the embeddings of all sentences, embeddings of all tokens in each sentence, and the tokens (with CLS and SEP) included. Unlike previous embeddings, token embeddings depend on the context; in the above example, the embeddings of the two “right“‘s are different as they have different meanings.

The default BERT models and tokenizers are bert-base_uncase. If you want to use others, refer to HuggingFace’s model list .

class shorttext.utils.transformers.BERTObject(model=None, tokenizer=None, trainable=False, device='cpu')

The base class for BERT model that contains the embedding model and the tokenizer.

For more information, please refer to the following paper:

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” arXiv:1810.04805 (2018). [arXiv]

class shorttext.utils.transformers.WrappedBERTEncoder(model=None, tokenizer=None, max_length=48, nbencodinglayers=4, trainable=False, device='cpu')

This is the class that encodes sentences with BERT models.

For more information, please refer to the following paper:

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” arXiv:1810.04805 (2018). [arXiv]

encode_sentences(sentences, numpy=False)

Encode the sentences into numerical vectors, given by a list of strings.

It can output either torch tensors or numpy arrays.

Parameters:
  • sentences (list) – list of strings to encode
  • numpy (bool) – output a numpy array if True; otherwise, output a torch tensor. (Default: False)
Returns:

encoded vectors for the sentences

Return type:

numpy.array or torch.Tensor

Other Functions

shorttext.utils.wordembed.shorttext_to_avgvec(shorttext, wvmodel)

Convert the short text into an averaged embedded vector representation.

Given a short sentence, it converts all the tokens into embedded vectors according to the given word-embedding model, sums them up, and normalize the resulting vector. It returns the resulting vector that represents this short sentence.

Parameters:
  • shorttext (str) – a short sentence
  • wvmodel (gensim.models.keyedvectors.KeyedVectors) – word-embedding model
Returns:

an embedded vector that represents the short sentence

Return type:

numpy.ndarray

Reference

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” arXiv:1810.04805 (2018). [arXiv]

Jayant Jain, “Implementing Poincaré Embeddings,” RaRe Technologies (2017). [RaRe]

Jeffrey Pennington, Richard Socher, Christopher D. Manning, “GloVe: Global Vectors for Word Representation,” Empirical Methods in Natural Language Processing (EMNLP), pp. 1532-1543 (2014). [PDF]

Maximilian Nickel, Douwe Kiela, “Poincaré Embeddings for Learning Hierarchical Representations,” arXiv:1705.08039 (2017). [arXiv]

Piotr Bojanowski, Edouard Grave, Armand Joulin, Tomas Mikolov, “Enriching Word Vectors with Subword Information,” arXiv:1607.04606 (2016). [arXiv]

Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, “Efficient Estimation of Word Representations in Vector Space,” ICLR 2013 (2013). [arXiv]

Radim Řehůřek, “Making sense of word2vec,” RaRe Technologies (2014). [RaRe]

“Probabilistic Theory of Word Embeddings: GloVe,” Everything About Data Analytics, WordPress (2016). [WordPress]

“Toying with Word2Vec,” Everything About Data Analytics, WordPress (2015). [WordPress]

“Word-Embedding Algorithms,” Everything About Data Analytics, WordPress (2016). [WordPress]

Home: Homepage of shorttext