Metrics

The package shorttext provides a few metrics that measure the distances of some kind. They are all under :module:`shorttext.metrics`. The soft Jaccard score is based on spellings, and the Word Mover’s distance (WMD) embedded word vectors.

Edit Distance and Soft Jaccard Score

Edit distance, or Damerau-Levenshtein distance, measures the differences between two words due to insertion, deletion, transposition, substitution etc. Each of this change causes a distance of 1. The algorithm was written in C.

First import the package:

>>> from shorttext.metrics.dynprog import damerau_levenshtein, longest_common_prefix, similarity, soft_jaccard_score

The distance can be calculated by:

>>> damerau_levenshtein('diver', 'driver')        # insertion, gives 1
>>> damerau_levenshtein('driver', 'diver')        # deletion, gives 1
>>> damerau_levenshtein('topology', 'tooplogy')   # transposition, gives 1
>>> damerau_levenshtein('book', 'blok')           # subsitution, gives 1

The longest common prefix finds the length of common prefix:

>>> longest_common_prefix('topology', 'topological')    # gives 7
>>> longest_common_prefix('police', 'policewoman')      # gives 6

The similarity between words is defined as the larger of the following:

\(s = 1 - \frac{\text{DL distance}}{\max( \text(len(word1)), \text(len(word2)) )}\) and \(s = \frac{\text{longest common prefix}}{\max( \text(len(word1)), \text(len(word2)) )}\)

>>> similarity('topology', 'topological')    # gives 0.6363636363636364
>>> similarity('book', 'blok')               # gives 0.75

Given the similarity, we say that the intersection, for example, between ‘book’ and ‘blok’, has 0.75 elements, or the union has 1.25 elements. Then the similarity between two sets of tokens can be measured using Jaccard index, with this “soft” numbers of intersection. Therefore,

>>> soft_jaccard_score(['book', 'seller'], ['blok', 'sellers'])     # gives 0.6716417910447762
>>> soft_jaccard_score(['police', 'station'], ['policeman'])        # gives 0.2857142857142858

The functions damerau_levenshtein and longest_common_prefix are implemented using Cython . (Before release 0.7.2, they were interfaced to Python using SWIG (Simplified Wrapper and Interface Generator)).

shorttext.metrics.dynprog.jaccard.similarity(word1, word2)

Return the similarity between the two words.

Return the similarity between the two words, between 0 and 1 inclusively. The similarity is the maximum of the two values: - 1 - Damerau-Levenshtein distance between two words / maximum length of the two words - longest common prefix of the two words / maximum length of the two words

Reference: Daniel E. Russ, Kwan-Yuet Ho, Calvin A. Johnson, Melissa C. Friesen, “Computer-Based Coding of Occupation Codes for Epidemiological Analyses,” 2014 IEEE 27th International Symposium on Computer-Based Medical Systems (CBMS), pp. 347-350. (2014) [IEEE]

Parameters:
  • word1 (str) – a word
  • word2 (str) – a word
Returns:

similarity, between 0 and 1 inclusively

Return type:

float

shorttext.metrics.dynprog.jaccard.soft_jaccard_score(tokens1, tokens2)

Return the soft Jaccard score of the two lists of tokens, between 0 and 1 inclusively.

Reference: Daniel E. Russ, Kwan-Yuet Ho, Calvin A. Johnson, Melissa C. Friesen, “Computer-Based Coding of Occupation Codes for Epidemiological Analyses,” 2014 IEEE 27th International Symposium on Computer-Based Medical Systems (CBMS), pp. 347-350. (2014) [IEEE]

Parameters:
  • tokens1 (list) – list of tokens.
  • tokens2 (list) – list of tokens.
Returns:

soft Jaccard score, between 0 and 1 inclusively.

Return type:

float

Word Mover’s Distance

Unlike soft Jaccard score that bases similarity on the words’ spellings, Word Mover’s distance (WMD) the embedded word vectors. WMD is a special case for Earth Mover’s distance (EMD), or Wasserstein distance. The calculation of WMD in this package is based on linear programming, and the distance between words are the Euclidean distance by default (not cosine distance), but user can set it accordingly.

Import the modules, and load the word-embedding models:

>>> from shorttext.metrics.wasserstein import word_mover_distance
>>> from shorttext.utils import load_word2vec_model
>>> wvmodel = load_word2vec_model('/path/to/model_file.bin')

Examples:

>>> word_mover_distance(['police', 'station'], ['policeman'], wvmodel)                      # gives 3.060708999633789
>>> word_mover_distance(['physician', 'assistant'], ['doctor', 'assistants'], wvmodel)      # gives 2.276337146759033

More examples can be found in this IPython Notebook .

In gensim, the Word2Vec model allows the calculation of WMD if user installed the package PyEMD. It is based on the scale invariant feature transform (SIFT), an algorithm for EMD based on L1-distance (Manhattan distance). For more details, please refer to their tutorial , and cite the two papers by Ofir Pele and Michael Werman if it is used.

shorttext.metrics.wasserstein.wordmoverdist.word_mover_distance(first_sent_tokens, second_sent_tokens, wvmodel, distancefunc=<function euclidean>, lpFile=None)

Compute the Word Mover’s distance (WMD) between the two given lists of tokens.

Using methods of linear programming, calculate the WMD between two lists of words. A word-embedding model has to be provided. WMD is returned.

Reference: Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, Kilian Q. Weinberger, “From Word Embeddings to Document Distances,” ICML (2015).

Parameters:
  • first_sent_tokens (list) – first list of tokens.
  • second_sent_tokens (list) – second list of tokens.
  • wvmodel (gensim.models.keyedvectors.KeyedVectors) – word-embedding models.
  • distancefunc (function) – distance function that takes two numpy ndarray.
  • lpFile (str) – deprecated, kept for backward incompatibility. (default: None)
Returns:

Word Mover’s distance (WMD)

Return type:

float

Jaccard Index Due to Cosine Distances

In the above section of edit distance, the Jaccard score was calculated by considering soft membership using spelling. However, we can also compute the soft membership by cosine similarity with

>>> from shorttext.utils import load_word2vec_model
>>> wvmodel = load_word2vec_model('/path/to/model_file.bin')
>>> from shorttext.metrics.embedfuzzy import jaccardscore_sents

For example, the number of words between the set containing ‘doctor’ and that containing ‘physician’ is 0.78060223420956831 (according to Google model), and therefore the Jaccard score is

\(0.78060223420956831 / (2-0.78060223420956831) = 0.6401538990056869\)

And it can be seen by running it:

>>> jaccardscore_sents('doctor', 'physician', wvmodel)   # gives 0.6401538990056869
>>> jaccardscore_sents('chief executive', 'computer cluster', wvmodel)   # gives 0.0022515450768836143
>>> jaccardscore_sents('topological data', 'data of topology', wvmodel)   # gives 0.67588977344632573
shorttext.metrics.embedfuzzy.jaccard.jaccardscore_sents(sent1, sent2, wvmodel, sim_words=<function <lambda>>)

Compute the Jaccard score between sentences based on their word similarities.

Parameters:
  • sent1 (str) – first sentence
  • sent2 (str) – second sentence
  • wvmodel (gensim.models.keyedvectors.KeyedVectors) – word-embeding model
  • sim_words (function) – function for calculating the similarities between a pair of word vectors (default: cosine)
Returns:

soft Jaccard score

Return type:

float

BERTScore

BERTScore includes a category of metrics that is based on BERT model. This metrics measures the similarity between sentences. To use it,

>>> from shorttext.metrics.transformers import BERTScorer
>>> scorer = BERTScorer()    # using default BERT model and tokenizer
>>> scorer.recall_bertscore('The weather is cold.', 'It is freezing.')   # 0.7223385572433472
>>> scorer.precision_bertscore('The weather is cold.', 'It is freezing.')   # 0.7700849175453186
>>> scorer.f1score_bertscore('The weather is cold.', 'It is freezing.')   # 0.7454479746418043

For BERT models, please refer to Word Embedding Models for more details.

class shorttext.metrics.transformers.bertscore.BERTScorer(model=None, tokenizer=None, max_length=48, nbencodinglayers=4, device='cpu')

This is the class that compute the BERTScores between sentences. BERTScores include recall BERTScores, precision BERTScores, and F1 BERTSscores. For more information, please refer to this paper:

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, Yoav Artzi, “BERTScore: Evaluating Text Generation with BERT,” arXiv:1904.09675 (2019). [arXiv]

compute_matrix(sentence_a, sentence_b)

Compute the table of similarities between all pairs of tokens. This is used for calculating the BERTScores.

Parameters:
  • sentence_a (str) – first sentence
  • sentence_b (str) – second sentence
Returns:

similarity matrix of between tokens in two sentences

Return type:

numpy.ndarray

f1score_bertscore(reference_sentence, test_sentence)

Compute the F1 BERTScore between two sentences.

Parameters:
  • reference_sentence (str) – reference sentence
  • test_sentence (str) – test sentence
Returns:

F1 BERTScore between the two sentences

Return type:

float

precision_bertscore(reference_sentence, test_sentence)

Compute the precision BERTScore between two sentences.

Parameters:
  • reference_sentence (str) – reference sentence
  • test_sentence (str) – test sentence
Returns:

precision BERTScore between the two sentences

Return type:

float

recall_bertscore(reference_sentence, test_sentence)

Compute the recall BERTScore between two sentences.

Parameters:
  • reference_sentence (str) – reference sentence
  • test_sentence (str) – test sentence
Returns:

recall BERTScore between the two sentences

Return type:

float

Reference

“Damerau-Levenshtein Distance.” [Wikipedia]

“Jaccard index.” [Wikipedia]

Daniel E. Russ, Kwan-Yuet Ho, Calvin A. Johnson, Melissa C. Friesen, “Computer-Based Coding of Occupation Codes for Epidemiological Analyses,” 2014 IEEE 27th International Symposium on Computer-Based Medical Systems (CBMS), pp. 347-350. (2014) [IEEE]

Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, Kilian Q. Weinberger, “From Word Embeddings to Document Distances,” ICML (2015).

Ofir Pele, Michael Werman, “A linear time histogram metric for improved SIFT matching,” Computer Vision - ECCV 2008, 495-508 (2008). [ACM]

Ofir Pele, Michael Werman, “Fast and robust earth mover’s distances,” Proc. 2009 IEEE 12th Int. Conf. on Computer Vision, 460-467 (2009). [IEEE]

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, Yoav Artzi, “BERTScore: Evaluating Text Generation with BERT,” arXiv:1904.09675 (2019). [arXiv]

“Word Mover’s Distance as a Linear Programming Problem,” Everything About Data Analytics, WordPress (2017). [WordPress]

Home: Homepage of shorttext