Word Embedding Models¶
Word2Vec¶
The most commonly used word-embedding model is Word2Vec. Its model can be downloaded from their page. To load the model, call:
>>> import shorttext
>>> wvmodel = shorttext.utils.load_word2vec_model('/path/to/GoogleNews-vectors-negative300.bin.gz')
It is a binary file, and the default is set to be binary=True.
-
shorttext.utils.wordembed.
load_word2vec_model
(path, binary=True)¶ Load a pre-trained Word2Vec model.
Parameters: - path (str) – path of the file of the pre-trained Word2Vec model
- binary (bool) – whether the file is in binary format (Default: True)
Returns: a pre-trained Word2Vec model
Return type: gensim.models.keyedvectors.KeyedVectors
It is equivalent to calling,
>>> import gensim
>>> wvmodel = gensim.models.KeyedVectors.load_word2vec_format('/path/to/GoogleNews-vectors-negative300.bin.gz', binary=True)
Word2Vec is a neural network model that embeds words into semantic vectors that carry semantic meaning. It is easy to extract the vector of a word, like for the word ‘coffee’:
>>> wvmodel['coffee'] # an ndarray for the word will be output
One can find the most similar words to ‘coffee’ according to this model:
>>> wvmodel.most_similar('coffee')
which outputs:
[(u'coffees', 0.721267819404602),
(u'gourmet_coffee', 0.7057087421417236),
(u'Coffee', 0.6900454759597778),
(u'o_joe', 0.6891065835952759),
(u'Starbucks_coffee', 0.6874972581863403),
(u'coffee_beans', 0.6749703884124756),
(u'latt\xe9', 0.664122462272644),
(u'cappuccino', 0.662549614906311),
(u'brewed_coffee', 0.6621608138084412),
(u'espresso', 0.6616827249526978)]
Or if you want to find the cosine similarity between ‘coffee’ and ‘tea’, enter:
>>> wvmodel.similarity('coffee', 'tea') # outputs: 0.56352921707810621
Semantic meaning can be reflected by their differences. For example, we can vaguely say Francis - Paris = Taiwan - Taipei, or man - actor = woman - actress. Define first the cosine similarity for readability:
>>> from scipy.spatial.distance import cosine
>>> similarity = lambda u, v: 1-cosine(u, v)
Then
>>> similarity(wvmodel['France'] + wvmodel['Taipei'] - wvmodel['Taiwan'], wvmodel['Paris']) # outputs: 0.70574580801216202
>>> similarity(wvmodel['woman'] + wvmodel['actor'] - wvmodel['man'], wvmodel['actress']) # outputs: 0.876354245612604
GloVe¶
Stanford NLP Group developed a similar word-embedding algorithm, with a good theory explaining how it works. It is extremely similar to Word2Vec.
One can convert a text-format GloVe model into a text-format Word2Vec model. More information can be found in the documentation of gensim: Converting GloVe to Word2Vec
FastText¶
FastText is a similar word-embedding model from Facebook. You can download pre-trained models here:
To load a pre-trained FastText model, run:
>>> import shorttext
>>> ftmodel = shorttext.utils.load_fasttext_model('/path/to/model.bin')
And it is used exactly the same way as Word2Vec.
-
shorttext.utils.wordembed.
load_fasttext_model
(path, encoding='utf-8')¶ Load a pre-trained FastText model.
Parameters: path (str) – path of the file of the pre-trained FastText model Returns: a pre-trained FastText model Return type: gensim.models.keyedvectors.FastTextKeyedVectors
Poincaré Embeddings¶
Poincaré embeddings is a new embedding that learns both semantic similarity and hierarchical structures. To load a pre-trained model, run:
>>> import shorttext
>>> pemodel = shorttext.utils.load_poincare_model('/path/to/model.txt')
For preloaded word-embedding models, please refer to Word Embedding Models.
-
shorttext.utils.wordembed.
load_poincare_model
(path, word2vec_format=True, binary=False)¶ Load a Poincare embedding model.
Parameters: - path (str) – path of the file of the pre-trained Poincare embedding model
- word2vec_format (bool) – whether to load from word2vec format (default: True)
- binary (bool) – binary format (default: False)
Returns: a pre-trained Poincare embedding model
Return type: gensim.models.poincare.PoincareKeyedVectors
BERT¶
BERT (Bidirectional Transformers for Language Understanding) is a transformer-based language model. This package supports tokens and sentence embeddings using pre-trained language models, supported by the package written by HuggingFace. In shorttext, to run:
>>> from shorttext.utils import WrappedBERTEncoder
>>> encoder = WrappedBERTEncoder() # the default model and tokenizer are loaded
>>> sentences_embedding, tokens_embedding, tokens = encoder.encode_sentences(['The car should turn right.', 'The answer is right.'])
The third line returns the embeddings of all sentences, embeddings of all tokens in each sentence, and the tokens (with CLS and SEP) included. Unlike previous embeddings, token embeddings depend on the context; in the above example, the embeddings of the two “right“‘s are different as they have different meanings.
The default BERT models and tokenizers are bert-base_uncase. If you want to use others, refer to HuggingFace’s model list .
-
class
shorttext.utils.transformers.
BERTObject
(model=None, tokenizer=None, trainable=False, device='cpu')¶ The base class for BERT model that contains the embedding model and the tokenizer.
For more information, please refer to the following paper:
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” arXiv:1810.04805 (2018). [arXiv]
-
class
shorttext.utils.transformers.
WrappedBERTEncoder
(model=None, tokenizer=None, max_length=48, nbencodinglayers=4, trainable=False, device='cpu')¶ This is the class that encodes sentences with BERT models.
For more information, please refer to the following paper:
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” arXiv:1810.04805 (2018). [arXiv]
-
encode_sentences
(sentences, numpy=False)¶ Encode the sentences into numerical vectors, given by a list of strings.
It can output either torch tensors or numpy arrays.
Parameters: - sentences (list) – list of strings to encode
- numpy (bool) – output a numpy array if True; otherwise, output a torch tensor. (Default: False)
Returns: encoded vectors for the sentences
Return type: numpy.array or torch.Tensor
-
Other Functions¶
-
shorttext.utils.wordembed.
shorttext_to_avgvec
(shorttext, wvmodel)¶ Convert the short text into an averaged embedded vector representation.
Given a short sentence, it converts all the tokens into embedded vectors according to the given word-embedding model, sums them up, and normalize the resulting vector. It returns the resulting vector that represents this short sentence.
Parameters: - shorttext (str) – a short sentence
- wvmodel (gensim.models.keyedvectors.KeyedVectors) – word-embedding model
Returns: an embedded vector that represents the short sentence
Return type: numpy.ndarray
Reference¶
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” arXiv:1810.04805 (2018). [arXiv]
Jayant Jain, “Implementing Poincaré Embeddings,” RaRe Technologies (2017). [RaRe]
Jeffrey Pennington, Richard Socher, Christopher D. Manning, “GloVe: Global Vectors for Word Representation,” Empirical Methods in Natural Language Processing (EMNLP), pp. 1532-1543 (2014). [PDF]
Maximilian Nickel, Douwe Kiela, “Poincaré Embeddings for Learning Hierarchical Representations,” arXiv:1705.08039 (2017). [arXiv]
Piotr Bojanowski, Edouard Grave, Armand Joulin, Tomas Mikolov, “Enriching Word Vectors with Subword Information,” arXiv:1607.04606 (2016). [arXiv]
Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, “Efficient Estimation of Word Representations in Vector Space,” ICLR 2013 (2013). [arXiv]
Radim Řehůřek, “Making sense of word2vec,” RaRe Technologies (2014). [RaRe]
“Probabilistic Theory of Word Embeddings: GloVe,” Everything About Data Analytics, WordPress (2016). [WordPress]
“Toying with Word2Vec,” Everything About Data Analytics, WordPress (2015). [WordPress]
“Word-Embedding Algorithms,” Everything About Data Analytics, WordPress (2016). [WordPress]
Home: Homepage of shorttext