Metrics¶
The package shorttext provides a few metrics that measure the distances of some kind. They are all under :module:`shorttext.metrics`. The soft Jaccard score is based on spellings, and the Word Mover’s distance (WMD) embedded word vectors.
Edit Distance and Soft Jaccard Score¶
Edit distance, or DamerauLevenshtein distance, measures the differences between two words due to insertion, deletion, transposition, substitution etc. Each of this change causes a distance of 1. The algorithm was written in C.
First import the package:
>>> from shorttext.metrics.dynprog import damerau_levenshtein, longest_common_prefix, similarity, soft_jaccard_score
The distance can be calculated by:
>>> damerau_levenshtein('diver', 'driver') # insertion, gives 1
>>> damerau_levenshtein('driver', 'diver') # deletion, gives 1
>>> damerau_levenshtein('topology', 'tooplogy') # transposition, gives 1
>>> damerau_levenshtein('book', 'blok') # subsitution, gives 1
The longest common prefix finds the length of common prefix:
>>> longest_common_prefix('topology', 'topological') # gives 7
>>> longest_common_prefix('police', 'policewoman') # gives 6
The similarity between words is defined as the larger of the following:
\(s = 1  \frac{\text{DL distance}}{\max( \text(len(word1)), \text(len(word2)) )}\) and \(s = \frac{\text{longest common prefix}}{\max( \text(len(word1)), \text(len(word2)) )}\)
>>> similarity('topology', 'topological') # gives 0.6363636363636364
>>> similarity('book', 'blok') # gives 0.75
Given the similarity, we say that the intersection, for example, between ‘book’ and ‘blok’, has 0.75 elements, or the union has 1.25 elements. Then the similarity between two sets of tokens can be measured using Jaccard index, with this “soft” numbers of intersection. Therefore,
>>> soft_jaccard_score(['book', 'seller'], ['blok', 'sellers']) # gives 0.6716417910447762
>>> soft_jaccard_score(['police', 'station'], ['policeman']) # gives 0.2857142857142858
The functions damerau_levenshtein and longest_common_prefix are implemented using Cython . (Before release 0.7.2, they were interfaced to Python using SWIG (Simplified Wrapper and Interface Generator)).

shorttext.metrics.dynprog.jaccard.
similarity
(word1, word2)¶ Return the similarity between the two words.
Return the similarity between the two words, between 0 and 1 inclusively. The similarity is the maximum of the two values:  1  DamerauLevenshtein distance between two words / maximum length of the two words  longest common prefix of the two words / maximum length of the two words
Reference: Daniel E. Russ, KwanYuet Ho, Calvin A. Johnson, Melissa C. Friesen, “ComputerBased Coding of Occupation Codes for Epidemiological Analyses,” 2014 IEEE 27th International Symposium on ComputerBased Medical Systems (CBMS), pp. 347350. (2014) [IEEE]
Parameters:  word1 (str) – a word
 word2 (str) – a word
Returns: similarity, between 0 and 1 inclusively
Return type: float

shorttext.metrics.dynprog.jaccard.
soft_jaccard_score
(tokens1, tokens2)¶ Return the soft Jaccard score of the two lists of tokens, between 0 and 1 inclusively.
Reference: Daniel E. Russ, KwanYuet Ho, Calvin A. Johnson, Melissa C. Friesen, “ComputerBased Coding of Occupation Codes for Epidemiological Analyses,” 2014 IEEE 27th International Symposium on ComputerBased Medical Systems (CBMS), pp. 347350. (2014) [IEEE]
Parameters:  tokens1 (list) – list of tokens.
 tokens2 (list) – list of tokens.
Returns: soft Jaccard score, between 0 and 1 inclusively.
Return type: float
Word Mover’s Distance¶
Unlike soft Jaccard score that bases similarity on the words’ spellings, Word Mover’s distance (WMD) the embedded word vectors. WMD is a special case for Earth Mover’s distance (EMD), or Wasserstein distance. The calculation of WMD in this package is based on linear programming, and the distance between words are the Euclidean distance by default (not cosine distance), but user can set it accordingly.
Import the modules, and load the wordembedding models:
>>> from shorttext.metrics.wasserstein import word_mover_distance
>>> from shorttext.utils import load_word2vec_model
>>> wvmodel = load_word2vec_model('/path/to/model_file.bin')
Examples:
>>> word_mover_distance(['police', 'station'], ['policeman'], wvmodel) # gives 3.060708999633789
>>> word_mover_distance(['physician', 'assistant'], ['doctor', 'assistants'], wvmodel) # gives 2.276337146759033
More examples can be found in this IPython Notebook .
In gensim, the Word2Vec model allows the calculation of WMD if user installed the package PyEMD. It is based on the scale invariant feature transform (SIFT), an algorithm for EMD based on L1distance (Manhattan distance). For more details, please refer to their tutorial , and cite the two papers by Ofir Pele and Michael Werman if it is used.

shorttext.metrics.wasserstein.wordmoverdist.
word_mover_distance
(first_sent_tokens, second_sent_tokens, wvmodel, distancefunc=<function euclidean>, lpFile=None)¶ Compute the Word Mover’s distance (WMD) between the two given lists of tokens.
Using methods of linear programming, calculate the WMD between two lists of words. A wordembedding model has to be provided. WMD is returned.
Reference: Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, Kilian Q. Weinberger, “From Word Embeddings to Document Distances,” ICML (2015).
Parameters:  first_sent_tokens (list) – first list of tokens.
 second_sent_tokens (list) – second list of tokens.
 wvmodel (gensim.models.keyedvectors.KeyedVectors) – wordembedding models.
 distancefunc (function) – distance function that takes two numpy ndarray.
 lpFile (str) – deprecated, kept for backward incompatibility. (default: None)
Returns: Word Mover’s distance (WMD)
Return type: float
Jaccard Index Due to Cosine Distances¶
In the above section of edit distance, the Jaccard score was calculated by considering soft membership using spelling. However, we can also compute the soft membership by cosine similarity with
>>> from shorttext.utils import load_word2vec_model
>>> wvmodel = load_word2vec_model('/path/to/model_file.bin')
>>> from shorttext.metrics.embedfuzzy import jaccardscore_sents
For example, the number of words between the set containing ‘doctor’ and that containing ‘physician’ is 0.78060223420956831 (according to Google model), and therefore the Jaccard score is
\(0.78060223420956831 / (20.78060223420956831) = 0.6401538990056869\)
And it can be seen by running it:
>>> jaccardscore_sents('doctor', 'physician', wvmodel) # gives 0.6401538990056869
>>> jaccardscore_sents('chief executive', 'computer cluster', wvmodel) # gives 0.0022515450768836143
>>> jaccardscore_sents('topological data', 'data of topology', wvmodel) # gives 0.67588977344632573

shorttext.metrics.embedfuzzy.jaccard.
jaccardscore_sents
(sent1, sent2, wvmodel, sim_words=<function <lambda>>)¶ Compute the Jaccard score between sentences based on their word similarities.
Parameters:  sent1 (str) – first sentence
 sent2 (str) – second sentence
 wvmodel (gensim.models.keyedvectors.KeyedVectors) – wordembeding model
 sim_words (function) – function for calculating the similarities between a pair of word vectors (default: cosine)
Returns: soft Jaccard score
Return type: float
BERTScore¶
BERTScore includes a category of metrics that is based on BERT model. This metrics measures the similarity between sentences. To use it,
>>> from shorttext.metrics.transformers import BERTScorer
>>> scorer = BERTScorer() # using default BERT model and tokenizer
>>> scorer.recall_bertscore('The weather is cold.', 'It is freezing.') # 0.7223385572433472
>>> scorer.precision_bertscore('The weather is cold.', 'It is freezing.') # 0.7700849175453186
>>> scorer.f1score_bertscore('The weather is cold.', 'It is freezing.') # 0.7454479746418043
For BERT models, please refer to Word Embedding Models for more details.

class
shorttext.metrics.transformers.bertscore.
BERTScorer
(model=None, tokenizer=None, max_length=48, nbencodinglayers=4, device='cpu')¶ This is the class that compute the BERTScores between sentences. BERTScores include recall BERTScores, precision BERTScores, and F1 BERTSscores. For more information, please refer to this paper:
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, Yoav Artzi, “BERTScore: Evaluating Text Generation with BERT,” arXiv:1904.09675 (2019). [arXiv]

compute_matrix
(sentence_a, sentence_b)¶ Compute the table of similarities between all pairs of tokens. This is used for calculating the BERTScores.
Parameters:  sentence_a (str) – first sentence
 sentence_b (str) – second sentence
Returns: similarity matrix of between tokens in two sentences
Return type: numpy.ndarray

f1score_bertscore
(reference_sentence, test_sentence)¶ Compute the F1 BERTScore between two sentences.
Parameters:  reference_sentence (str) – reference sentence
 test_sentence (str) – test sentence
Returns: F1 BERTScore between the two sentences
Return type: float

precision_bertscore
(reference_sentence, test_sentence)¶ Compute the precision BERTScore between two sentences.
Parameters:  reference_sentence (str) – reference sentence
 test_sentence (str) – test sentence
Returns: precision BERTScore between the two sentences
Return type: float

recall_bertscore
(reference_sentence, test_sentence)¶ Compute the recall BERTScore between two sentences.
Parameters:  reference_sentence (str) – reference sentence
 test_sentence (str) – test sentence
Returns: recall BERTScore between the two sentences
Return type: float

Reference¶
“DamerauLevenshtein Distance.” [Wikipedia]
“Jaccard index.” [Wikipedia]
Daniel E. Russ, KwanYuet Ho, Calvin A. Johnson, Melissa C. Friesen, “ComputerBased Coding of Occupation Codes for Epidemiological Analyses,” 2014 IEEE 27th International Symposium on ComputerBased Medical Systems (CBMS), pp. 347350. (2014) [IEEE]
Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, Kilian Q. Weinberger, “From Word Embeddings to Document Distances,” ICML (2015).
Ofir Pele, Michael Werman, “A linear time histogram metric for improved SIFT matching,” Computer Vision  ECCV 2008, 495508 (2008). [ACM]
Ofir Pele, Michael Werman, “Fast and robust earth mover’s distances,” Proc. 2009 IEEE 12th Int. Conf. on Computer Vision, 460467 (2009). [IEEE]
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, Yoav Artzi, “BERTScore: Evaluating Text Generation with BERT,” arXiv:1904.09675 (2019). [arXiv]
“Word Mover’s Distance as a Linear Programming Problem,” Everything About Data Analytics, WordPress (2017). [WordPress]
Home: Homepage of shorttext