Metrics

The package shorttext provides a few metrics that measure the distances of some kind. They are all under :module:`shorttext.metrics`. The soft Jaccard score is based on spellings, and the Word Mover’s distance (WMD) embedded word vectors.

Edit Distance and Soft Jaccard Score

Edit distance, or Damerau-Levenshtein distance, measures the differences between two words due to insertion, deletion, transposition, substitution etc. Each of this change causes a distance of 1. The algorithm was written in C.

First import the package:

>>> from shorttext import damerau_levenshtein, longest_common_prefix, similarity, soft_jaccard_score
>>> from shorttext.metrics.dynprog.lcp import damerau_levenshtein, longest_common_prefix, similarity, soft_jaccard_score
>>> from shorttext.metrics.dynprog import similarity, soft_jaccard_score

The distance can be calculated by:

>>> damerau_levenshtein('diver', 'driver')        # insertion, gives 1
>>> damerau_levenshtein('driver', 'diver')        # deletion, gives 1
>>> damerau_levenshtein('topology', 'tooplogy')   # transposition, gives 1
>>> damerau_levenshtein('book', 'blok')           # subsitution, gives 1

The longest common prefix finds the length of common prefix:

>>> longest_common_prefix('topology', 'topological')    # gives 7
>>> longest_common_prefix('police', 'policewoman')      # gives 6

The similarity between words is defined as the larger of the following: between two words due to insertion, deletion, transposition, substitution etc. Each of this change causes a distance of 1. The algorithm was written in C.

First import the package:

>>> from shorttext import damerau_levenshtein, longest_common_prefix, similarity, soft_jaccard_score
>>> from shorttext.metrics.dynprog.lcp import damerau_levenshtein, longest_common_prefix, similarity, soft_jaccard_score
>>> from shorttext.metrics.dynprog import similarity, soft_jaccard_score

The distance can be calculated by:

>>> damerau_levenshtein('diver', 'driver')        # insertion, gives 1
>>> damerau_levenshtein('driver', 'diver')        # deletion, gives 1
>>> damerau_levenshtein('topology', 'tooplogy')   # transposition, gives 1
>>> damerau_levenshtein('book', 'blok')           # subsitution, gives 1

The longest common prefix finds the length of common prefix:

>>> longest_common_prefix('topology', 'topological')    # gives 7
>>> longest_common_prefix('police', 'policewoman')      # gives 6

The similarity between words is defined as the larger of the following: between two words due to insertion, deletion, transposition, substitution etc. Each of this change causes a distance of 1. The algorithm was written in C.

First import the package:

>>> from shorttext import damerau_levenshtein, longest_common_prefix, similarity, soft_jaccard_score
>>> from shorttext.metrics.dynprog.lcp import damerau_levenshtein, longest_common_prefix, similarity, soft_jaccard_score
>>> from shorttext.metrics.dynprog import similarity, soft_jaccard_score

The distance can be calculated by:

>>> damerau_levenshtein('diver', 'driver')        # insertion, gives 1
>>> damerau_levenshtein('driver', 'diver')        # deletion, gives 1
>>> damerau_levenshtein('topology', 'tooplogy')   # transposition, gives 1
>>> damerau_levenshtein('book', 'blok')           # subsitution, gives 1

The longest common prefix finds the length of common prefix:

>>> longest_common_prefix('topology', 'topological')    # gives 7
>>> longest_common_prefix('police', 'policewoman')      # gives 6

The similarity between words is defined as the larger of the following: between two words due to insertion, deletion, transposition, substitution etc. Each of this change causes a distance of 1. The algorithm was written in C.

First import the package:

>>> from shorttext import damerau_levenshtein, longest_common_prefix, similarity, soft_jaccard_score
>>> from shorttext.metrics.dynprog.lcp import damerau_levenshtein, longest_common_prefix, similarity, soft_jaccard_score
>>> from shorttext.metrics.dynprog import similarity, soft_jaccard_score

The distance can be calculated by:

>>> damerau_levenshtein('diver', 'driver')        # insertion, gives 1
>>> damerau_levenshtein('driver', 'diver')        # deletion, gives 1
>>> damerau_levenshtein('topology', 'tooplogy')   # transposition, gives 1
>>> damerau_levenshtein('book', 'blok')           # subsitution, gives 1

The longest common prefix finds the length of common prefix:

>>> longest_common_prefix('topology', 'topological')    # gives 7
>>> longest_common_prefix('police', 'policewoman')      # gives 6

The similarity between words is defined as the larger of the following: between two words due to insertion, deletion, transposition, substitution etc. Each of this change causes a distance of 1. The algorithm was written in C.

First import the package:

>>> from shorttext import damerau_levenshtein, longest_common_prefix, similarity, soft_jaccard_score
>>> from shorttext.metrics.dynprog.lcp import damerau_levenshtein, longest_common_prefix, similarity, soft_jaccard_score
>>> from shorttext.metrics.dynprog import similarity, soft_jaccard_score

The distance can be calculated by:

>>> damerau_levenshtein('diver', 'driver')        # insertion, gives 1
>>> damerau_levenshtein('driver', 'diver')        # deletion, gives 1
>>> damerau_levenshtein('topology', 'tooplogy')   # transposition, gives 1
>>> damerau_levenshtein('book', 'blok')           # subsitution, gives 1

The longest common prefix finds the length of common prefix:

>>> longest_common_prefix('topology', 'topological')    # gives 7
>>> longest_common_prefix('police', 'policewoman')      # gives 6

The similarity between words is defined as the larger of the following: between two words due to insertion, deletion, transposition, substitution etc. Each of this change causes a distance of 1. The algorithm was written in C.

First import the package:

>>> from shorttext import damerau_levenshtein, longest_common_prefix, similarity, soft_jaccard_score
>>> from shorttext.metrics.dynprog.lcp import damerau_levenshtein, longest_common_prefix, similarity, soft_jaccard_score
>>> from shorttext.metrics.dynprog import similarity, soft_jaccard_score

The distance can be calculated by:

>>> damerau_levenshtein('diver', 'driver')        # insertion, gives 1
>>> damerau_levenshtein('driver', 'diver')        # deletion, gives 1
>>> damerau_levenshtein('topology', 'tooplogy')   # transposition, gives 1
>>> damerau_levenshtein('book', 'blok')           # subsitution, gives 1

The longest common prefix finds the length of common prefix:

>>> longest_common_prefix('topology', 'topological')    # gives 7
>>> longest_common_prefix('police', 'policewoman')      # gives 6

The similarity between words is defined as the larger of the following: between two words due to insertion, deletion, transposition, substitution etc. Each of this change causes a distance of 1. The algorithm was written in C.

First import the package:

>>> from shorttext.metrics.dynprog.dldist import damerau_levenshtein, longest_common_prefix, similarity, soft_jaccard_score
>>> from shorttext.metrics.dynprog.lcp import damerau_levenshtein, longest_common_prefix, similarity, soft_jaccard_score
>>> from shorttext.metrics.dynprog import similarity, soft_jaccard_score

The distance can be calculated by:

>>> damerau_levenshtein('diver', 'driver')        # insertion, gives 1
>>> damerau_levenshtein('driver', 'diver')        # deletion, gives 1
>>> damerau_levenshtein('topology', 'tooplogy')   # transposition, gives 1
>>> damerau_levenshtein('book', 'blok')           # subsitution, gives 1

The longest common prefix finds the length of common prefix:

>>> longest_common_prefix('topology', 'topological')    # gives 7
>>> longest_common_prefix('police', 'policewoman')      # gives 6

The similarity between words is defined as the larger of the following:

\(s = 1 - \frac{\text{DL distance}}{\max( \text(len(word1)), \text(len(word2)) )}\) and \(s = \frac{\text{longest common prefix}}{\max( \text(len(word1)), \text(len(word2)) )}\)

>>> similarity('topology', 'topological')    # gives 0.6363636363636364
>>> similarity('book', 'blok')               # gives 0.75

Given the similarity, we say that the intersection, for example, between ‘book’ and ‘blok’, has 0.75 elements, or the union has 1.25 elements. Then the similarity between two sets of tokens can be measured using Jaccard index, with this “soft” numbers of intersection. Therefore,

>>> soft_jaccard_score(['book', 'seller'], ['blok', 'sellers'])     # gives 0.6716417910447762
>>> soft_jaccard_score(['police', 'station'], ['policeman'])        # gives 0.2857142857142858

The functions damerau_levenshtein and longest_common_prefix are implemented using Cython . (Before release 0.7.2, they were interfaced to Python using SWIG (Simplified Wrapper and Interface Generator)).

shorttext.metrics.dynprog.jaccard.similarity(word1: str, word2: str) float[source]

Calculate similarity between two words.

Computes similarity as the maximum of: - 1 - Damerau-Levenshtein distance / max length - Longest common prefix length / max length

Args:

word1: First word. word2: Second word.

Returns:

Similarity score between 0 and 1.

Reference:

Daniel E. Russ, Kwan-Yuet Ho, Calvin A. Johnson, Melissa C. Friesen, “Computer-Based Coding of Occupation Codes for Epidemiological Analyses,” IEEE CBMS 2014, pp. 347-350. http://ieeexplore.ieee.org/abstract/document/6881904/

shorttext.metrics.dynprog.jaccard.soft_intersection_list(tokens1: list[str], tokens2: list[str]) set[str][source]

Compute soft intersection between two token lists.

Finds the best matching pairs between tokens using similarity, where each token can only match once.

Args:

tokens1: First list of tokens. tokens2: Second list of tokens.

Returns:

Set of ((token1, token2), similarity) tuples representing matches.

shorttext.metrics.dynprog.jaccard.soft_jaccard_score(tokens1: str, tokens2: str) float[source]

Compute soft Jaccard score between token lists.

Uses fuzzy matching based on edit distance and longest common prefix.

Args:

tokens1: First list of tokens. tokens2: Second list of tokens.

Returns:

Soft Jaccard score between 0 and 1.

Reference:

Daniel E. Russ, Kwan-Yuet Ho, Calvin A. Johnson, Melissa C. Friesen, “Computer-Based Coding of Occupation Codes for Epidemiological Analyses,” IEEE CBMS 2014, pp. 347-350. http://ieeexplore.ieee.org/abstract/document/6881904/

Word Mover’s Distance

Unlike soft Jaccard score that bases similarity on the words’ spellings, Word Mover’s distance (WMD) the embedded word vectors. WMD is a special case for Earth Mover’s distance (EMD), or Wasserstein distance. The calculation of WMD in this package is based on linear programming, and the distance between words are the Euclidean distance by default (not cosine distance), but user can set it accordingly.

Import the modules, and load the word-embedding models:

>>> from shorttext import word_mover_distance
>>> from shorttext.utils import load_word2vec_model
>>> wvmodel = load_word2vec_model('/path/to/model_file.bin')

Examples:

>>> word_mover_distance(['police', 'station'], ['policeman'], wvmodel)                      # gives 3.060708999633789
>>> word_mover_distance(['physician', 'assistant'], ['doctor', 'assistants'], wvmodel)      # gives 2.276337146759033

More examples can be found in this the embedded word vectors. WMD is a special case for Earth Mover’s distance (EMD), or Wasserstein distance. The calculation of WMD in this package is based on linear programming, and the distance between words are the Euclidean distance by default (not cosine distance), but user can set it accordingly.

Import the modules, and load the word-embedding models:

>>> from shorttext import word_mover_distance
>>> from shorttext.utils import load_word2vec_model
>>> wvmodel = load_word2vec_model('/path/to/model_file.bin')

Examples:

>>> word_mover_distance(['police', 'station'], ['policeman'], wvmodel)                      # gives 3.060708999633789
>>> word_mover_distance(['physician', 'assistant'], ['doctor', 'assistants'], wvmodel)      # gives 2.276337146759033

More examples can be found in this the embedded word vectors. WMD is a special case for Earth Mover’s distance (EMD), or Wasserstein distance. The calculation of WMD in this package is based on linear programming, and the distance between words are the Euclidean distance by default (not cosine distance), but user can set it accordingly.

Import the modules, and load the word-embedding models:

>>> from shorttext import word_mover_distance
>>> from shorttext.utils import load_word2vec_model
>>> wvmodel = load_word2vec_model('/path/to/model_file.bin')

Examples:

>>> word_mover_distance(['police', 'station'], ['policeman'], wvmodel)                      # gives 3.060708999633789
>>> word_mover_distance(['physician', 'assistant'], ['doctor', 'assistants'], wvmodel)      # gives 2.276337146759033

More examples can be found in this the embedded word vectors. WMD is a special case for Earth Mover’s distance (EMD), or Wasserstein distance. The calculation of WMD in this package is based on linear programming, and the distance between words are the Euclidean distance by default (not cosine distance), but user can set it accordingly.

Import the modules, and load the word-embedding models:

>>> from shorttext import word_mover_distance
>>> from shorttext.utils import load_word2vec_model
>>> wvmodel = load_word2vec_model('/path/to/model_file.bin')

Examples:

>>> word_mover_distance(['police', 'station'], ['policeman'], wvmodel)                      # gives 3.060708999633789
>>> word_mover_distance(['physician', 'assistant'], ['doctor', 'assistants'], wvmodel)      # gives 2.276337146759033

More examples can be found in this the embedded word vectors. WMD is a special case for Earth Mover’s distance (EMD), or Wasserstein distance. The calculation of WMD in this package is based on linear programming, and the distance between words are the Euclidean distance by default (not cosine distance), but user can set it accordingly.

Import the modules, and load the word-embedding models:

>>> from shorttext import word_mover_distance
>>> from shorttext.utils import load_word2vec_model
>>> wvmodel = load_word2vec_model('/path/to/model_file.bin')

Examples:

>>> word_mover_distance(['police', 'station'], ['policeman'], wvmodel)                      # gives 3.060708999633789
>>> word_mover_distance(['physician', 'assistant'], ['doctor', 'assistants'], wvmodel)      # gives 2.276337146759033

More examples can be found in this the embedded word vectors. WMD is a special case for Earth Mover’s distance (EMD), or Wasserstein distance. The calculation of WMD in this package is based on linear programming, and the distance between words are the Euclidean distance by default (not cosine distance), but user can set it accordingly.

Import the modules, and load the word-embedding models:

>>> from shorttext import word_mover_distance
>>> from shorttext.utils import load_word2vec_model
>>> wvmodel = load_word2vec_model('/path/to/model_file.bin')

Examples:

>>> word_mover_distance(['police', 'station'], ['policeman'], wvmodel)                      # gives 3.060708999633789
>>> word_mover_distance(['physician', 'assistant'], ['doctor', 'assistants'], wvmodel)      # gives 2.276337146759033

More examples can be found in this the embedded word vectors. WMD is a special case for Earth Mover’s distance (EMD), or Wasserstein distance. The calculation of WMD in this package is based on linear programming, and the distance between words are the Euclidean distance by default (not cosine distance), but user can set it accordingly.

Import the modules, and load the word-embedding models:

>>> from shorttext.metrics.wasserstein import word_mover_distance
>>> from shorttext.utils import load_word2vec_model
>>> wvmodel = load_word2vec_model('/path/to/model_file.bin')

Examples:

>>> word_mover_distance(['police', 'station'], ['policeman'], wvmodel)                      # gives 3.060708999633789
>>> word_mover_distance(['physician', 'assistant'], ['doctor', 'assistants'], wvmodel)      # gives 2.276337146759033

More examples can be found in this IPython Notebook .

In gensim, the Word2Vec model allows the calculation of WMD if user installed the package PyEMD. It is based on the scale invariant feature transform (SIFT), an algorithm for EMD based on L1-distance (Manhattan distance). For more details, please refer to their tutorial , and cite the two papers by Ofir Pele and Michael Werman if it is used.

shorttext.metrics.wasserstein.wordmoverdist.word_mover_distance_linprog(first_sent_tokens: list[str], second_sent_tokens: list[str], wvmodel: gensim.models.keyedvectors.KeyedVectors, distancefunc: callable | None = None) OptimizeResult[source]

Compute Word Mover’s distance via linear programming.

Uses scipy.optimize.linprog to compute the transport problem for the Word Mover’s Distance.

Args:

first_sent_tokens: First list of tokens. second_sent_tokens: Second list of tokens. wvmodel: Word embedding model. distancefunc: Distance function for word vectors. Default: Euclidean.

Returns:

scipy.optimize.OptimizeResult containing the optimization result.

Reference:

Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, Kilian Q. Weinberger, “From Word Embeddings to Document Distances,” ICML 2015.

shorttext.metrics.wasserstein.wordmoverdist.word_mover_distance(first_sent_tokens: list[str], second_sent_tokens: list[str], wvmodel: gensim.models.keyedvectors.KeyedVectors, distancefunc: callable | None = None) float[source]

Compute Word Mover’s distance between token lists.

Uses word embeddings to compute the minimum transport cost between words in two sentences.

Args:

first_sent_tokens: First list of tokens. second_sent_tokens: Second list of tokens. wvmodel: Word embedding model. distancefunc: Distance function for word vectors. Default: Euclidean.

Returns:

The Word Mover’s distance (lower is more similar).

Reference:

Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, Kilian Q. Weinberger, “From Word Embeddings to Document Distances,” ICML 2015.

Jaccard Index Due to Cosine Distances

In the above section of edit distance, the Jaccard score was calculated by considering soft membership using spelling. However, we can also compute the soft membership by cosine similarity with

>>> from shorttext import load_word2vec_model
>>> wvmodel = load_word2vec_model('/path/to/model_file.bin')
>>> from shorttext.metrics.embedfuzzy import jaccardscore_sents

For example, the number of words between the set containing ‘doctor’ and that containing ‘physician’ is 0.78060223420956831 (according to Google model), and therefore the Jaccard score is using spelling. However, we can also compute the soft membership by cosine similarity with

>>> from shorttext import load_word2vec_model
>>> wvmodel = load_word2vec_model('/path/to/model_file.bin')
>>> from shorttext.metrics.embedfuzzy import jaccardscore_sents

For example, the number of words between the set containing ‘doctor’ and that containing ‘physician’ is 0.78060223420956831 (according to Google model), and therefore the Jaccard score is using spelling. However, we can also compute the soft membership by cosine similarity with

>>> from shorttext import load_word2vec_model
>>> wvmodel = load_word2vec_model('/path/to/model_file.bin')
>>> from shorttext.metrics.embedfuzzy import jaccardscore_sents

For example, the number of words between the set containing ‘doctor’ and that containing ‘physician’ is 0.78060223420956831 (according to Google model), and therefore the Jaccard score is using spelling. However, we can also compute the soft membership by cosine similarity with

>>> from shorttext import load_word2vec_model
>>> wvmodel = load_word2vec_model('/path/to/model_file.bin')
>>> from shorttext.metrics.embedfuzzy import jaccardscore_sents

For example, the number of words between the set containing ‘doctor’ and that containing ‘physician’ is 0.78060223420956831 (according to Google model), and therefore the Jaccard score is using spelling. However, we can also compute the soft membership by cosine similarity with

>>> from shorttext import load_word2vec_model
>>> wvmodel = load_word2vec_model('/path/to/model_file.bin')
>>> from shorttext.metrics.embedfuzzy import jaccardscore_sents

For example, the number of words between the set containing ‘doctor’ and that containing ‘physician’ is 0.78060223420956831 (according to Google model), and therefore the Jaccard score is using spelling. However, we can also compute the soft membership by cosine similarity with

>>> from shorttext import load_word2vec_model
>>> wvmodel = load_word2vec_model('/path/to/model_file.bin')
>>> from shorttext.metrics.embedfuzzy import jaccardscore_sents

For example, the number of words between the set containing ‘doctor’ and that containing ‘physician’ is 0.78060223420956831 (according to Google model), and therefore the Jaccard score is using spelling. However, we can also compute the soft membership by cosine similarity with

>>> from shorttext.utils import load_word2vec_model
>>> wvmodel = load_word2vec_model('/path/to/model_file.bin')
>>> from shorttext.metrics.embedfuzzy import jaccardscore_sents

For example, the number of words between the set containing ‘doctor’ and that containing ‘physician’ is 0.78060223420956831 (according to Google model), and therefore the Jaccard score is

\(0.78060223420956831 / (2-0.78060223420956831) = 0.6401538990056869\)

And it can be seen by running it:

>>> jaccardscore_sents('doctor', 'physician', wvmodel)   # gives 0.6401538990056869
>>> jaccardscore_sents('chief executive', 'computer cluster', wvmodel)   # gives 0.0022515450768836143
>>> jaccardscore_sents('topological data', 'data of topology', wvmodel)   # gives 0.67588977344632573
shorttext.metrics.embedfuzzy.jaccard.jaccardscore_sents(sent1: str, sent2: str, wvmodel: gensim.models.keyedvectors.KeyedVectors, sim_words: callable | None = None) float[source]

Compute Jaccard score between sentences using embeddings.

Uses word embeddings to compute a fuzzy Jaccard score where word similarity is measured via embedding cosine similarity.

Args:

sent1: First sentence. sent2: Second sentence. wvmodel: Word embedding model. sim_words: Similarity function for word vectors. Default: cosine.

Returns:

Fuzzy Jaccard score between 0 and 1.

Reference

“Damerau-Levenshtein Distance.” [Wikipedia]

“Jaccard index.” [Wikipedia]

Daniel E. Russ, Kwan-Yuet Ho, Calvin A. Johnson, Melissa C. Friesen, “Computer-Based Coding of Occupation Codes for Epidemiological Analyses,” 2014 IEEE 27th International Symposium on Computer-Based Medical Systems (CBMS), pp. 347-350. (2014) [IEEE]

Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, Kilian Q. Weinberger, “From Word Embeddings to Document Distances,” ICML (2015).

Ofir Pele, Michael Werman, “A linear time histogram metric for improved SIFT matching,” Computer Vision - ECCV 2008, 495-508 (2008). [ACM]

Ofir Pele, Michael Werman, “Fast and robust earth mover’s distances,” Proc. 2009 IEEE 12th Int. Conf. on Computer Vision, 460-467 (2009). [IEEE]

“Word Mover’s Distance as a Linear Programming Problem,” Everything About Data Analytics, WordPress (2017). [WordPress]

Home: Homepage of shorttext