Word-Embedding Cosine Similarity Classifier

Sum of Embedded Vectors

Given a pre-trained word-embedding models like Word2Vec, a classifier based on cosine similarities can be built, which is shorttext.classifiers.SumEmbeddedVecClassifier. In training the data, the embedded vectors in every word in that class are averaged. The score for a given text to each class is the cosine similarity between the averaged vector of the given text and the precalculated vector of that class.

A pre-trained Google Word2Vec model can be downloaded here.

See: Word Embedding Models .

Import the package:

>>> import shorttext

To load the Word2Vec model,

>>> from shorttext.utils import load_word2vec_model
>>> wvmodel = load_word2vec_model('/path/to/GoogleNews-vectors-negative300.bin.gz')

Then we load a set of data:

>>> nihtraindata = shorttext.data.nihreports(sample_size=None)

Then initialize the classifier:

>>> classifier = shorttext.classifiers.SumEmbeddedVecClassifier(wvmodel)   # for Google model, the vector size is 300 (default: 100)
>>> classifier.train(nihtraindata)

This classifier takes relatively little time to train compared with others in this package. Then we can perform classification:

>>> classifier.score('bioinformatics')

Or the result can be sorted and only the five top-scored results are displayed:

>>> sorted(classifier.score('stem cell research').items(), key=lambda item: item[1], reverse=True)[:5]
[('NIGMS', 0.44962596182682935),
 ('NIAID', 0.4494126990050461),
 ('NINDS', 0.43435236806719524),
 ('NIDCR', 0.43042338197002483),
 ('NHGRI', 0.42878346869968731)]
>>> sorted(classifier.score('bioinformatics').items(), key=lambda item: item[1], reverse=True)[:5]
[('NHGRI', 0.54200061864847038),
 ('NCATS', 0.49097267547279988),
 ('NIGMS', 0.47818129591411118),
 ('CIT', 0.46874987052158501),
 ('NLM', 0.46869259072562974)]
>>> sorted(classifier.score('cancer immunotherapy').items(), key=lambda item: item[1], reverse=True)[:5]
[('NCI', 0.53734097785976076),
 ('NIAID', 0.50616582142027433),
 ('NIDCR', 0.48596330887674788),
 ('NIDDK', 0.46875755765903215),
 ('NCCAM', 0.4642233792198418)]

The trained model can be saved:

>>> classifier.save_compact_model('/path/to/sumvec_nihdata_model.bin')

And with the same pre-trained Word2Vec model, this classifier can be loaded:

>>> classifier2 = shorttext.classifiers.load_sumword2vec_classifier(wvmodel, '/path/to/sumvec_nihdata_model.bin')
class shorttext.classifiers.embed.sumvec.SumEmbedVecClassification.SumEmbeddedVecClassifier(wvmodel, vecsize=None, simfcn=<function SumEmbeddedVecClassifier.<lambda>>)

This is a supervised classification algorithm for short text categorization. Each class label has a few short sentences, where each token is converted to an embedded vector, given by a pre-trained word-embedding model (e.g., Google Word2Vec model). They are then summed up and normalized to a unit vector for that particular class labels. To perform prediction, the input short sentences is converted to a unit vector in the same way. The similarity score is calculated by the cosine similarity.

A pre-trained Google Word2Vec model can be downloaded here.

loadmodel(nameprefix)

Load a trained model from files.

Given the prefix of the file paths, load the model from files with name given by the prefix followed by “_embedvecdict.pickle”.

If this has not been run, or a model was not trained by train(), a ModelNotTrainedException will be raised while performing prediction and saving the model.

Parameters:nameprefix (str) – prefix of the file path
Returns:None
savemodel(nameprefix)

Save the trained model into files.

Given the prefix of the file paths, save the model into files, with name given by the prefix, and add “_embedvecdict.pickle” at the end. If there is no trained model, a ModelNotTrainedException will be thrown.

Parameters:nameprefix (str) – prefix of the file path
Returns:None
Raise:ModelNotTrainedException
score(shorttext)

Calculate the scores for all the class labels for the given short sentence.

Given a short sentence, calculate the classification scores for all class labels, returned as a dictionary with key being the class labels, and values being the scores. If the short sentence is empty, or if other numerical errors occur, the score will be numpy.nan.

If neither train() nor loadmodel() was run, it will raise ModelNotTrainedException.

Parameters:shorttext (str) – a short sentence
Returns:a dictionary with keys being the class labels, and values being the corresponding classification scores
Return type:dict
Raise:ModelNotTrainedException
shorttext_to_embedvec(shorttext)

Convert the short text into an averaged embedded vector representation.

Given a short sentence, it converts all the tokens into embedded vectors according to the given word-embedding model, sums them up, and normalize the resulting vector. It returns the resulting vector that represents this short sentence.

Parameters:shorttext (str) – a short sentence
Returns:an embedded vector that represents the short sentence
Return type:numpy.ndarray
train(classdict)

Train the classifier.

If this has not been run, or a model was not loaded by loadmodel(), a ModelNotTrainedException will be raised while performing prediction or saving the model.

Parameters:classdict (dict) – training data
Returns:None

Appendix: Model I/O in Previous Versions

In previous versions of shorttext, shorttext.classifiers.SumEmbeddedVecClassifier has a savemodel method, which runs as follow:

>>> classifier.savemodel('/path/to/nihdata')

This produces the following file for this model:

/path/to/nihdata_embedvecdict.pkl

It can be loaded by:

>>> classifier2 = shorttext.classifiers.load_sumword2vec_classifier(wvmodel, '/path/to/nihdata', compact=False)

Reference

Michael Czerny, “Modern Methods for Sentiment Analysis,” *District Data Labs (2015). [DistrictDataLabs]

Home: Homepage of shorttext