Word Embedding Models
Word2Vec
The most commonly used word-embedding model is Word2Vec. Its model can be downloaded from their page. To load the model, call:
>>> import shorttext
>>> wvmodel = shorttext.utils.load_word2vec_model('/path/to/GoogleNews-vectors-negative300.bin.gz')
It is a binary file, and the default is set to be their page. To load the model, call:
>>> import shorttext
>>> wvmodel = shorttext.utils.load_word2vec_model('/path/to/GoogleNews-vectors-negative300.bin.gz')
It is a binary file, and the default is set to be their page. To load the model, call:
>>> import shorttext
>>> wvmodel = shorttext.utils.load_word2vec_model('/path/to/GoogleNews-vectors-negative300.bin.gz')
It is a binary file, and the default is set to be their page. To load the model, call:
>>> import shorttext
>>> wvmodel = shorttext.utils.load_word2vec_model('/path/to/GoogleNews-vectors-negative300.bin.gz')
It is a binary file, and the default is set to be their page. To load the model, call:
>>> import shorttext
>>> wvmodel = shorttext.utils.load_word2vec_model('/path/to/GoogleNews-vectors-negative300.bin.gz')
It is a binary file, and the default is set to be their page. To load the model, call:
>>> import shorttext
>>> wvmodel = shorttext.utils.load_word2vec_model('/path/to/GoogleNews-vectors-negative300.bin.gz')
It is a binary file, and the default is set to be their page. To load the model, call:
>>> import shorttext
>>> wvmodel = shorttext.utils.load_word2vec_model('/path/to/GoogleNews-vectors-negative300.bin.gz')
It is a binary file, and the default is set to be their page. To load the model, call:
>>> import shorttext
>>> wvmodel = shorttext.utils.load_word2vec_model('/path/to/GoogleNews-vectors-negative300.bin.gz')
It is a binary file, and the default is set to be their page. To load the model, call:
>>> import shorttext
>>> wvmodel = shorttext.utils.load_word2vec_model('/path/to/GoogleNews-vectors-negative300.bin.gz')
It is a binary file, and the default is set to be their page. To load the model, call:
>>> import shorttext
>>> wvmodel = shorttext.utils.load_word2vec_model('/path/to/GoogleNews-vectors-negative300.bin.gz')
It is a binary file, and the default is set to be their page. To load the model, call:
>>> import shorttext
>>> wvmodel = shorttext.utils.load_word2vec_model('/path/to/GoogleNews-vectors-negative300.bin.gz')
It is a binary file, and the default is set to be their page. To load the model, call:
>>> import shorttext
>>> wvmodel = shorttext.utils.load_word2vec_model('/path/to/GoogleNews-vectors-negative300.bin.gz')
It is a binary file, and the default is set to be their page. To load the model, call:
>>> import shorttext
>>> wvmodel = shorttext.utils.load_word2vec_model('/path/to/GoogleNews-vectors-negative300.bin.gz')
It is a binary file, and the default is set to be binary=True.
- shorttext.utils.wordembed.load_word2vec_model(path: str | PathLike, binary: bool = True) gensim.models.keyedvectors.KeyedVectors[source]
Load a pre-trained Word2Vec model.
- Args:
path: Path to the Word2Vec model file. binary: Whether the file is in binary format. Default: True.
- Returns:
A KeyedVectors model containing word embeddings.
- shorttext.utils.wordembed.load_fasttext_model(path: str | PathLike, encoding: Any = 'utf-8') gensim.models.fasttext.FastTextKeyedVectors[source]
Load a pre-trained FastText model.
- Args:
path: Path to the FastText model file. encoding: File encoding. Default: ‘utf-8’.
- Returns:
A FastTextKeyedVectors model.
- shorttext.utils.wordembed.load_poincare_model(path: str | PathLike, word2vec_format: bool = True, binary: bool = False) gensim.models.poincare.PoincareKeyedVectors[source]
Load a Poincaré embedding model.
- Args:
path: Path to the Poincaré model file. word2vec_format: Whether to load from word2vec format. Default: True. binary: Whether file is binary. Default: False.
- Returns:
A PoincareKeyedVectors model.
- shorttext.utils.wordembed.shorttext_to_avgvec(shorttext: str, wvmodel: gensim.models.keyedvectors.KeyedVectors) Annotated[ndarray[tuple[Any, ...], dtype[float64]], '1D array'][source]
Convert short text to averaged embedding vector.
Converts each token to its word embedding, averages them, and normalizes the result.
- Args:
shorttext: Input text. wvmodel: Word embedding model.
- Returns:
A normalized vector representation of the text.
- class shorttext.utils.wordembed.RESTfulKeyedVectors(*args: Any, **kwargs: Any)[source]
Bases:
KeyedVectorsRemote word vector client via REST API.
Connects to a remote WordEmbedAPI service to access word embeddings via HTTP requests.
- Attributes:
url: Base URL of the API. port: Port number for the API.
- __init__(url: str, port: str | int = '5000')[source]
Initialize the client.
- Args:
url: Base URL of the API (e.g., ‘http://localhost’). port: Port number. Default: ‘5000’.
- closer_than(entity1: str, entity2: str) list | dict[source]
Find words closer to entity1 than entity2 is.
- Args:
entity1: First word. entity2: Reference word.
- Returns:
List of words closer to entity1 than entity2.
- distance(entity1: str, entity2: str) float[source]
Compute distance between two words.
- Args:
entity1: First word. entity2: Second word.
- Returns:
Distance between the word vectors.
- distances(entity1: str, other_entities: list[str] | None = None) Annotated[ndarray[tuple[Any, ...], dtype[float64]], '1D array'][source]
Compute distances from one word to multiple words.
- Args:
entity1: First word. other_entities: List of words to compare against.
- Returns:
Array of distances.
- get_vector(entity: str) Annotated[ndarray[tuple[Any, ...], dtype[float64]], '1D array'][source]
Get word vector for a word.
- Args:
entity: Word to get vector for.
- Returns:
Word embedding vector.
- Raises:
KeyError: If word not in vocabulary.
- most_similar(**kwargs) list[tuple[str, float]][source]
Find most similar words.
- Args:
**kwargs: Arguments passed to the API (e.g., positive, negative).
- Returns:
List of (word, similarity) tuples.
- most_similar_to_given(entity1: str, entities_list: list[str]) list[str][source]
Find most similar word from a list to a given word.
- Args:
entity1: Reference word. entities_list: List of candidate words.
- Returns:
List of words sorted by similarity.
- rank(entity1: str, entity2: str) int[source]
Get similarity rank between two words.
- Args:
entity1: First word. entity2: Second word.
- Returns:
Rank of entity2 relative to entity1.
It is equivalent to calling,
>>> import gensim
>>> wvmodel = gensim.models.KeyedVectors.load_word2vec_format('/path/to/GoogleNews-vectors-negative300.bin.gz', binary=True)
Word2Vec is a neural network model that embeds words into semantic vectors that carry semantic meaning. It is easy to extract the vector of a word, like for the word ‘coffee’:
>>> wvmodel['coffee'] # an ndarray for the word will be output
One can find the most similar words to ‘coffee’ according to this model:
>>> wvmodel.most_similar('coffee')
which outputs:
[(u'coffees', 0.721267819404602),
(u'gourmet_coffee', 0.7057087421417236),
(u'Coffee', 0.6900454759597778),
(u'o_joe', 0.6891065835952759),
(u'Starbucks_coffee', 0.6874972581863403),
(u'coffee_beans', 0.6749703884124756),
(u'latt\xe9', 0.664122462272644),
(u'cappuccino', 0.662549614906311),
(u'brewed_coffee', 0.6621608138084412),
(u'espresso', 0.6616827249526978)]
Or if you want to find the cosine similarity between ‘coffee’ and ‘tea’, enter:
>>> wvmodel.similarity('coffee', 'tea') # outputs: 0.56352921707810621
Semantic meaning can be reflected by their differences. For example, we can vaguely say Francis - Paris = Taiwan - Taipei, or man - actor = woman - actress. Define first the cosine similarity for readability:
>>> from scipy.spatial.distance import cosine
>>> similarity = lambda u, v: 1-cosine(u, v)
Then
>>> similarity(wvmodel['France'] + wvmodel['Taipei'] - wvmodel['Taiwan'], wvmodel['Paris']) # outputs: 0.70574580801216202
>>> similarity(wvmodel['woman'] + wvmodel['actor'] - wvmodel['man'], wvmodel['actress']) # outputs: 0.876354245612604
GloVe
Stanford NLP Group developed a similar word-embedding algorithm, with a good theory explaining how it works. It is extremely similar to Word2Vec.
One can convert a text-format GloVe model into a text-format Word2Vec model. More information can be found in the documentation of gensim: Converting GloVe to Word2Vec
FastText
FastText is a similar word-embedding model from Facebook. You can download pre-trained models here:
To load a pre-trained FastText model, run:
>>> import shorttext
>>> ftmodel = shorttext.utils.load_fasttext_model('/path/to/model.bin')
And it is used exactly the same way as Word2Vec.
- shorttext.utils.wordembed.load_word2vec_model(path: str | PathLike, binary: bool = True) gensim.models.keyedvectors.KeyedVectors[source]
Load a pre-trained Word2Vec model.
- Args:
path: Path to the Word2Vec model file. binary: Whether the file is in binary format. Default: True.
- Returns:
A KeyedVectors model containing word embeddings.
- shorttext.utils.wordembed.load_fasttext_model(path: str | PathLike, encoding: Any = 'utf-8') gensim.models.fasttext.FastTextKeyedVectors[source]
Load a pre-trained FastText model.
- Args:
path: Path to the FastText model file. encoding: File encoding. Default: ‘utf-8’.
- Returns:
A FastTextKeyedVectors model.
- shorttext.utils.wordembed.load_poincare_model(path: str | PathLike, word2vec_format: bool = True, binary: bool = False) gensim.models.poincare.PoincareKeyedVectors[source]
Load a Poincaré embedding model.
- Args:
path: Path to the Poincaré model file. word2vec_format: Whether to load from word2vec format. Default: True. binary: Whether file is binary. Default: False.
- Returns:
A PoincareKeyedVectors model.
- shorttext.utils.wordembed.shorttext_to_avgvec(shorttext: str, wvmodel: gensim.models.keyedvectors.KeyedVectors) Annotated[ndarray[tuple[Any, ...], dtype[float64]], '1D array'][source]
Convert short text to averaged embedding vector.
Converts each token to its word embedding, averages them, and normalizes the result.
- Args:
shorttext: Input text. wvmodel: Word embedding model.
- Returns:
A normalized vector representation of the text.
- class shorttext.utils.wordembed.RESTfulKeyedVectors(*args: Any, **kwargs: Any)[source]
Bases:
KeyedVectorsRemote word vector client via REST API.
Connects to a remote WordEmbedAPI service to access word embeddings via HTTP requests.
- Attributes:
url: Base URL of the API. port: Port number for the API.
- __init__(url: str, port: str | int = '5000')[source]
Initialize the client.
- Args:
url: Base URL of the API (e.g., ‘http://localhost’). port: Port number. Default: ‘5000’.
- closer_than(entity1: str, entity2: str) list | dict[source]
Find words closer to entity1 than entity2 is.
- Args:
entity1: First word. entity2: Reference word.
- Returns:
List of words closer to entity1 than entity2.
- distance(entity1: str, entity2: str) float[source]
Compute distance between two words.
- Args:
entity1: First word. entity2: Second word.
- Returns:
Distance between the word vectors.
- distances(entity1: str, other_entities: list[str] | None = None) Annotated[ndarray[tuple[Any, ...], dtype[float64]], '1D array'][source]
Compute distances from one word to multiple words.
- Args:
entity1: First word. other_entities: List of words to compare against.
- Returns:
Array of distances.
- get_vector(entity: str) Annotated[ndarray[tuple[Any, ...], dtype[float64]], '1D array'][source]
Get word vector for a word.
- Args:
entity: Word to get vector for.
- Returns:
Word embedding vector.
- Raises:
KeyError: If word not in vocabulary.
- most_similar(**kwargs) list[tuple[str, float]][source]
Find most similar words.
- Args:
**kwargs: Arguments passed to the API (e.g., positive, negative).
- Returns:
List of (word, similarity) tuples.
- most_similar_to_given(entity1: str, entities_list: list[str]) list[str][source]
Find most similar word from a list to a given word.
- Args:
entity1: Reference word. entities_list: List of candidate words.
- Returns:
List of words sorted by similarity.
- rank(entity1: str, entity2: str) int[source]
Get similarity rank between two words.
- Args:
entity1: First word. entity2: Second word.
- Returns:
Rank of entity2 relative to entity1.
Poincaré Embeddings
Poincaré embeddings is a new embedding that learns both semantic similarity and hierarchical structures. To load a pre-trained model, run:
>>> import shorttext
>>> pemodel = shorttext.utils.load_poincare_model('/path/to/model.txt')
For preloaded word-embedding models, please refer to Word Embedding Models.
- shorttext.utils.wordembed.load_word2vec_model(path: str | PathLike, binary: bool = True) gensim.models.keyedvectors.KeyedVectors[source]
Load a pre-trained Word2Vec model.
- Args:
path: Path to the Word2Vec model file. binary: Whether the file is in binary format. Default: True.
- Returns:
A KeyedVectors model containing word embeddings.
- shorttext.utils.wordembed.load_fasttext_model(path: str | PathLike, encoding: Any = 'utf-8') gensim.models.fasttext.FastTextKeyedVectors[source]
Load a pre-trained FastText model.
- Args:
path: Path to the FastText model file. encoding: File encoding. Default: ‘utf-8’.
- Returns:
A FastTextKeyedVectors model.
- shorttext.utils.wordembed.load_poincare_model(path: str | PathLike, word2vec_format: bool = True, binary: bool = False) gensim.models.poincare.PoincareKeyedVectors[source]
Load a Poincaré embedding model.
- Args:
path: Path to the Poincaré model file. word2vec_format: Whether to load from word2vec format. Default: True. binary: Whether file is binary. Default: False.
- Returns:
A PoincareKeyedVectors model.
- shorttext.utils.wordembed.shorttext_to_avgvec(shorttext: str, wvmodel: gensim.models.keyedvectors.KeyedVectors) Annotated[ndarray[tuple[Any, ...], dtype[float64]], '1D array'][source]
Convert short text to averaged embedding vector.
Converts each token to its word embedding, averages them, and normalizes the result.
- Args:
shorttext: Input text. wvmodel: Word embedding model.
- Returns:
A normalized vector representation of the text.
- class shorttext.utils.wordembed.RESTfulKeyedVectors(*args: Any, **kwargs: Any)[source]
Bases:
KeyedVectorsRemote word vector client via REST API.
Connects to a remote WordEmbedAPI service to access word embeddings via HTTP requests.
- Attributes:
url: Base URL of the API. port: Port number for the API.
- __init__(url: str, port: str | int = '5000')[source]
Initialize the client.
- Args:
url: Base URL of the API (e.g., ‘http://localhost’). port: Port number. Default: ‘5000’.
- closer_than(entity1: str, entity2: str) list | dict[source]
Find words closer to entity1 than entity2 is.
- Args:
entity1: First word. entity2: Reference word.
- Returns:
List of words closer to entity1 than entity2.
- distance(entity1: str, entity2: str) float[source]
Compute distance between two words.
- Args:
entity1: First word. entity2: Second word.
- Returns:
Distance between the word vectors.
- distances(entity1: str, other_entities: list[str] | None = None) Annotated[ndarray[tuple[Any, ...], dtype[float64]], '1D array'][source]
Compute distances from one word to multiple words.
- Args:
entity1: First word. other_entities: List of words to compare against.
- Returns:
Array of distances.
- get_vector(entity: str) Annotated[ndarray[tuple[Any, ...], dtype[float64]], '1D array'][source]
Get word vector for a word.
- Args:
entity: Word to get vector for.
- Returns:
Word embedding vector.
- Raises:
KeyError: If word not in vocabulary.
- most_similar(**kwargs) list[tuple[str, float]][source]
Find most similar words.
- Args:
**kwargs: Arguments passed to the API (e.g., positive, negative).
- Returns:
List of (word, similarity) tuples.
- most_similar_to_given(entity1: str, entities_list: list[str]) list[str][source]
Find most similar word from a list to a given word.
- Args:
entity1: Reference word. entities_list: List of candidate words.
- Returns:
List of words sorted by similarity.
- rank(entity1: str, entity2: str) int[source]
Get similarity rank between two words.
- Args:
entity1: First word. entity2: Second word.
- Returns:
Rank of entity2 relative to entity1.
Other Functions
- shorttext.utils.wordembed.load_word2vec_model(path: str | PathLike, binary: bool = True) gensim.models.keyedvectors.KeyedVectors[source]
Load a pre-trained Word2Vec model.
- Args:
path: Path to the Word2Vec model file. binary: Whether the file is in binary format. Default: True.
- Returns:
A KeyedVectors model containing word embeddings.
- shorttext.utils.wordembed.load_fasttext_model(path: str | PathLike, encoding: Any = 'utf-8') gensim.models.fasttext.FastTextKeyedVectors[source]
Load a pre-trained FastText model.
- Args:
path: Path to the FastText model file. encoding: File encoding. Default: ‘utf-8’.
- Returns:
A FastTextKeyedVectors model.
- shorttext.utils.wordembed.load_poincare_model(path: str | PathLike, word2vec_format: bool = True, binary: bool = False) gensim.models.poincare.PoincareKeyedVectors[source]
Load a Poincaré embedding model.
- Args:
path: Path to the Poincaré model file. word2vec_format: Whether to load from word2vec format. Default: True. binary: Whether file is binary. Default: False.
- Returns:
A PoincareKeyedVectors model.
- shorttext.utils.wordembed.shorttext_to_avgvec(shorttext: str, wvmodel: gensim.models.keyedvectors.KeyedVectors) Annotated[ndarray[tuple[Any, ...], dtype[float64]], '1D array'][source]
Convert short text to averaged embedding vector.
Converts each token to its word embedding, averages them, and normalizes the result.
- Args:
shorttext: Input text. wvmodel: Word embedding model.
- Returns:
A normalized vector representation of the text.
- class shorttext.utils.wordembed.RESTfulKeyedVectors(*args: Any, **kwargs: Any)[source]
Bases:
KeyedVectorsRemote word vector client via REST API.
Connects to a remote WordEmbedAPI service to access word embeddings via HTTP requests.
- Attributes:
url: Base URL of the API. port: Port number for the API.
- __init__(url: str, port: str | int = '5000')[source]
Initialize the client.
- Args:
url: Base URL of the API (e.g., ‘http://localhost’). port: Port number. Default: ‘5000’.
- closer_than(entity1: str, entity2: str) list | dict[source]
Find words closer to entity1 than entity2 is.
- Args:
entity1: First word. entity2: Reference word.
- Returns:
List of words closer to entity1 than entity2.
- distance(entity1: str, entity2: str) float[source]
Compute distance between two words.
- Args:
entity1: First word. entity2: Second word.
- Returns:
Distance between the word vectors.
- distances(entity1: str, other_entities: list[str] | None = None) Annotated[ndarray[tuple[Any, ...], dtype[float64]], '1D array'][source]
Compute distances from one word to multiple words.
- Args:
entity1: First word. other_entities: List of words to compare against.
- Returns:
Array of distances.
- get_vector(entity: str) Annotated[ndarray[tuple[Any, ...], dtype[float64]], '1D array'][source]
Get word vector for a word.
- Args:
entity: Word to get vector for.
- Returns:
Word embedding vector.
- Raises:
KeyError: If word not in vocabulary.
- most_similar(**kwargs) list[tuple[str, float]][source]
Find most similar words.
- Args:
**kwargs: Arguments passed to the API (e.g., positive, negative).
- Returns:
List of (word, similarity) tuples.
- most_similar_to_given(entity1: str, entities_list: list[str]) list[str][source]
Find most similar word from a list to a given word.
- Args:
entity1: Reference word. entities_list: List of candidate words.
- Returns:
List of words sorted by similarity.
- rank(entity1: str, entity2: str) int[source]
Get similarity rank between two words.
- Args:
entity1: First word. entity2: Second word.
- Returns:
Rank of entity2 relative to entity1.
Links
Reference
Jayant Jain, “Implementing Poincaré Embeddings,” RaRe Technologies (2017). [RaRe]
Jeffrey Pennington, Richard Socher, Christopher D. Manning, “GloVe: Global Vectors for Word Representation,” Empirical Methods in Natural Language Processing (EMNLP), pp. 1532-1543 (2014). [PDF]
Maximilian Nickel, Douwe Kiela, “Poincaré Embeddings for Learning Hierarchical Representations,” arXiv:1705.08039 (2017). [arXiv]
Piotr Bojanowski, Edouard Grave, Armand Joulin, Tomas Mikolov, “Enriching Word Vectors with Subword Information,” arXiv:1607.04606 (2016). [arXiv]
Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, “Efficient Estimation of Word Representations in Vector Space,” ICLR 2013 (2013). [arXiv]
Radim Řehůřek, “Making sense of word2vec,” RaRe Technologies (2014). [RaRe]
“Probabilistic Theory of Word Embeddings: GloVe,” Everything About Data Analytics, WordPress (2016). [WordPress]
“Toying with Word2Vec,” Everything About Data Analytics, WordPress (2015). [WordPress]
“Word-Embedding Algorithms,” Everything About Data Analytics, WordPress (2016). [WordPress]
Home: Homepage of shorttext