Word-Embedding Cosine Similarity Classifier
Sum of Embedded Vectors
Given a pre-trained word-embedding models like Word2Vec, a classifier
based on cosine similarities can be built, which is
shorttext.classifiers.SumEmbeddedVecClassifier.
In training the data,
the embedded vectors in every word in that class are averaged. The
score for a given text to each class is the cosine similarity between the averaged
vector of the given text and the precalculated vector of that class.
A pre-trained Google Word2Vec model can be downloaded here.
See: Word Embedding Models .
Import the package:
>>> import shorttext
To load the Word2Vec model,
>>> from shorttext import load_word2vec_model
>>> wvmodel = load_word2vec_model('/path/to/GoogleNews-vectors-negative300.bin.gz')
Then we load a set of data:
>>> nihtraindata = shorttext.data.nihreports(sample_size=None)
Then initialize the classifier:
>>> classifier = shorttext.classifiers.SumEmbeddedVecClassifier(wvmodel) # for Google model, the vector size is 300 (default: 100)
>>> classifier.train(nihtraindata)
This classifier takes relatively little time to train compared with others in this package. Then we can perform classification:
>>> classifier.score('bioinformatics')
Or the result can be sorted and only the five top-scored results are displayed:
>>> sorted(classifier.score('stem cell research').items(), key=lambda item: item[1], reverse=True)[:5]
[('NIGMS', 0.44962596182682935),
('NIAID', 0.4494126990050461),
('NINDS', 0.43435236806719524),
('NIDCR', 0.43042338197002483),
('NHGRI', 0.42878346869968731)]
>>> sorted(classifier.score('bioinformatics').items(), key=lambda item: item[1], reverse=True)[:5]
[('NHGRI', 0.54200061864847038),
('NCATS', 0.49097267547279988),
('NIGMS', 0.47818129591411118),
('CIT', 0.46874987052158501),
('NLM', 0.46869259072562974)]
>>> sorted(classifier.score('cancer immunotherapy').items(), key=lambda item: item[1], reverse=True)[:5]
[('NCI', 0.53734097785976076),
('NIAID', 0.50616582142027433),
('NIDCR', 0.48596330887674788),
('NIDDK', 0.46875755765903215),
('NCCAM', 0.4642233792198418)]
The trained model can be saved:
>>> classifier.save_compact_model('/path/to/sumvec_nihdata_model.bin')
And with the same pre-trained Word2Vec model, this classifier can be loaded:
>>> classifier2 = shorttext.classifiers.SumEmbeddedVecClassifier.from_pretrained(wvmodel, '/path/to/sumvec_nihdata_model.bin')
Import the package:
>>> import shorttext
To load the Word2Vec model,
>>> from shorttext import load_word2vec_model
>>> wvmodel = load_word2vec_model('/path/to/GoogleNews-vectors-negative300.bin.gz')
Then we load a set of data:
>>> nihtraindata = shorttext.data.nihreports(sample_size=None)
Then initialize the classifier:
>>> classifier = shorttext.classifiers.SumEmbeddedVecClassifier(wvmodel) # for Google model, the vector size is 300 (default: 100)
>>> classifier.train(nihtraindata)
This classifier takes relatively little time to train compared with others in this package. Then we can perform classification:
>>> classifier.score('bioinformatics')
Or the result can be sorted and only the five top-scored results are displayed:
>>> sorted(classifier.score('stem cell research').items(), key=lambda item: item[1], reverse=True)[:5]
[('NIGMS', 0.44962596182682935),
('NIAID', 0.4494126990050461),
('NINDS', 0.43435236806719524),
('NIDCR', 0.43042338197002483),
('NHGRI', 0.42878346869968731)]
>>> sorted(classifier.score('bioinformatics').items(), key=lambda item: item[1], reverse=True)[:5]
[('NHGRI', 0.54200061864847038),
('NCATS', 0.49097267547279988),
('NIGMS', 0.47818129591411118),
('CIT', 0.46874987052158501),
('NLM', 0.46869259072562974)]
>>> sorted(classifier.score('cancer immunotherapy').items(), key=lambda item: item[1], reverse=True)[:5]
[('NCI', 0.53734097785976076),
('NIAID', 0.50616582142027433),
('NIDCR', 0.48596330887674788),
('NIDDK', 0.46875755765903215),
('NCCAM', 0.4642233792198418)]
The trained model can be saved:
>>> classifier.save_compact_model('/path/to/sumvec_nihdata_model.bin')
And with the same pre-trained Word2Vec model, this classifier can be loaded:
>>> classifier2 = shorttext.classifiers.SumEmbeddedVecClassifier.from_pretrained(wvmodel, '/path/to/sumvec_nihdata_model.bin')
Import the package:
>>> import shorttext
To load the Word2Vec model,
>>> from shorttext import load_word2vec_model
>>> wvmodel = load_word2vec_model('/path/to/GoogleNews-vectors-negative300.bin.gz')
Then we load a set of data:
>>> nihtraindata = shorttext.data.nihreports(sample_size=None)
Then initialize the classifier:
>>> classifier = shorttext.classifiers.SumEmbeddedVecClassifier(wvmodel) # for Google model, the vector size is 300 (default: 100)
>>> classifier.train(nihtraindata)
This classifier takes relatively little time to train compared with others in this package. Then we can perform classification:
>>> classifier.score('bioinformatics')
Or the result can be sorted and only the five top-scored results are displayed:
>>> sorted(classifier.score('stem cell research').items(), key=lambda item: item[1], reverse=True)[:5]
[('NIGMS', 0.44962596182682935),
('NIAID', 0.4494126990050461),
('NINDS', 0.43435236806719524),
('NIDCR', 0.43042338197002483),
('NHGRI', 0.42878346869968731)]
>>> sorted(classifier.score('bioinformatics').items(), key=lambda item: item[1], reverse=True)[:5]
[('NHGRI', 0.54200061864847038),
('NCATS', 0.49097267547279988),
('NIGMS', 0.47818129591411118),
('CIT', 0.46874987052158501),
('NLM', 0.46869259072562974)]
>>> sorted(classifier.score('cancer immunotherapy').items(), key=lambda item: item[1], reverse=True)[:5]
[('NCI', 0.53734097785976076),
('NIAID', 0.50616582142027433),
('NIDCR', 0.48596330887674788),
('NIDDK', 0.46875755765903215),
('NCCAM', 0.4642233792198418)]
The trained model can be saved:
>>> classifier.save_compact_model('/path/to/sumvec_nihdata_model.bin')
And with the same pre-trained Word2Vec model, this classifier can be loaded:
>>> classifier2 = shorttext.classifiers.SumEmbeddedVecClassifier.from_pretrained(wvmodel, '/path/to/sumvec_nihdata_model.bin')
Import the package:
>>> import shorttext
To load the Word2Vec model,
>>> from shorttext import load_word2vec_model
>>> wvmodel = load_word2vec_model('/path/to/GoogleNews-vectors-negative300.bin.gz')
Then we load a set of data:
>>> nihtraindata = shorttext.data.nihreports(sample_size=None)
Then initialize the classifier:
>>> classifier = shorttext.classifiers.SumEmbeddedVecClassifier(wvmodel) # for Google model, the vector size is 300 (default: 100)
>>> classifier.train(nihtraindata)
This classifier takes relatively little time to train compared with others in this package. Then we can perform classification:
>>> classifier.score('bioinformatics')
Or the result can be sorted and only the five top-scored results are displayed:
>>> sorted(classifier.score('stem cell research').items(), key=lambda item: item[1], reverse=True)[:5]
[('NIGMS', 0.44962596182682935),
('NIAID', 0.4494126990050461),
('NINDS', 0.43435236806719524),
('NIDCR', 0.43042338197002483),
('NHGRI', 0.42878346869968731)]
>>> sorted(classifier.score('bioinformatics').items(), key=lambda item: item[1], reverse=True)[:5]
[('NHGRI', 0.54200061864847038),
('NCATS', 0.49097267547279988),
('NIGMS', 0.47818129591411118),
('CIT', 0.46874987052158501),
('NLM', 0.46869259072562974)]
>>> sorted(classifier.score('cancer immunotherapy').items(), key=lambda item: item[1], reverse=True)[:5]
[('NCI', 0.53734097785976076),
('NIAID', 0.50616582142027433),
('NIDCR', 0.48596330887674788),
('NIDDK', 0.46875755765903215),
('NCCAM', 0.4642233792198418)]
The trained model can be saved:
>>> classifier.save_compact_model('/path/to/sumvec_nihdata_model.bin')
And with the same pre-trained Word2Vec model, this classifier can be loaded:
>>> classifier2 = shorttext.classifiers.SumEmbeddedVecClassifier.from_pretrained(wvmodel, '/path/to/sumvec_nihdata_model.bin')
Import the package:
>>> import shorttext
To load the Word2Vec model,
>>> from shorttext import load_word2vec_model
>>> wvmodel = load_word2vec_model('/path/to/GoogleNews-vectors-negative300.bin.gz')
Then we load a set of data:
>>> nihtraindata = shorttext.data.nihreports(sample_size=None)
Then initialize the classifier:
>>> classifier = shorttext.classifiers.SumEmbeddedVecClassifier(wvmodel) # for Google model, the vector size is 300 (default: 100)
>>> classifier.train(nihtraindata)
This classifier takes relatively little time to train compared with others in this package. Then we can perform classification:
>>> classifier.score('bioinformatics')
Or the result can be sorted and only the five top-scored results are displayed:
>>> sorted(classifier.score('stem cell research').items(), key=lambda item: item[1], reverse=True)[:5]
[('NIGMS', 0.44962596182682935),
('NIAID', 0.4494126990050461),
('NINDS', 0.43435236806719524),
('NIDCR', 0.43042338197002483),
('NHGRI', 0.42878346869968731)]
>>> sorted(classifier.score('bioinformatics').items(), key=lambda item: item[1], reverse=True)[:5]
[('NHGRI', 0.54200061864847038),
('NCATS', 0.49097267547279988),
('NIGMS', 0.47818129591411118),
('CIT', 0.46874987052158501),
('NLM', 0.46869259072562974)]
>>> sorted(classifier.score('cancer immunotherapy').items(), key=lambda item: item[1], reverse=True)[:5]
[('NCI', 0.53734097785976076),
('NIAID', 0.50616582142027433),
('NIDCR', 0.48596330887674788),
('NIDDK', 0.46875755765903215),
('NCCAM', 0.4642233792198418)]
The trained model can be saved:
>>> classifier.save_compact_model('/path/to/sumvec_nihdata_model.bin')
And with the same pre-trained Word2Vec model, this classifier can be loaded:
>>> classifier2 = shorttext.classifiers.SumEmbeddedVecClassifier.from_pretrained(wvmodel, '/path/to/sumvec_nihdata_model.bin')
Import the package:
>>> import shorttext
To load the Word2Vec model,
>>> from shorttext import load_word2vec_model
>>> wvmodel = load_word2vec_model('/path/to/GoogleNews-vectors-negative300.bin.gz')
Then we load a set of data:
>>> nihtraindata = shorttext.data.nihreports(sample_size=None)
Then initialize the classifier:
>>> classifier = shorttext.classifiers.SumEmbeddedVecClassifier(wvmodel) # for Google model, the vector size is 300 (default: 100)
>>> classifier.train(nihtraindata)
This classifier takes relatively little time to train compared with others in this package. Then we can perform classification:
>>> classifier.score('bioinformatics')
Or the result can be sorted and only the five top-scored results are displayed:
>>> sorted(classifier.score('stem cell research').items(), key=lambda item: item[1], reverse=True)[:5]
[('NIGMS', 0.44962596182682935),
('NIAID', 0.4494126990050461),
('NINDS', 0.43435236806719524),
('NIDCR', 0.43042338197002483),
('NHGRI', 0.42878346869968731)]
>>> sorted(classifier.score('bioinformatics').items(), key=lambda item: item[1], reverse=True)[:5]
[('NHGRI', 0.54200061864847038),
('NCATS', 0.49097267547279988),
('NIGMS', 0.47818129591411118),
('CIT', 0.46874987052158501),
('NLM', 0.46869259072562974)]
>>> sorted(classifier.score('cancer immunotherapy').items(), key=lambda item: item[1], reverse=True)[:5]
[('NCI', 0.53734097785976076),
('NIAID', 0.50616582142027433),
('NIDCR', 0.48596330887674788),
('NIDDK', 0.46875755765903215),
('NCCAM', 0.4642233792198418)]
The trained model can be saved:
>>> classifier.save_compact_model('/path/to/sumvec_nihdata_model.bin')
And with the same pre-trained Word2Vec model, this classifier can be loaded:
>>> classifier2 = shorttext.classifiers.SumEmbeddedVecClassifier.from_pretrained(wvmodel, '/path/to/sumvec_nihdata_model.bin')
Import the package:
>>> import shorttext
To load the Word2Vec model,
>>> from shorttext.utils import load_word2vec_model
>>> wvmodel = load_word2vec_model('/path/to/GoogleNews-vectors-negative300.bin.gz')
Then we load a set of data:
>>> nihtraindata = shorttext.data.nihreports(sample_size=None)
Then initialize the classifier:
>>> classifier = shorttext.classifiers.SumEmbeddedVecClassifier(wvmodel) # for Google model, the vector size is 300 (default: 100)
>>> classifier.train(nihtraindata)
This classifier takes relatively little time to train compared with others in this package. Then we can perform classification:
>>> classifier.score('bioinformatics')
Or the result can be sorted and only the five top-scored results are displayed:
>>> sorted(classifier.score('stem cell research').items(), key=lambda item: item[1], reverse=True)[:5]
[('NIGMS', 0.44962596182682935),
('NIAID', 0.4494126990050461),
('NINDS', 0.43435236806719524),
('NIDCR', 0.43042338197002483),
('NHGRI', 0.42878346869968731)]
>>> sorted(classifier.score('bioinformatics').items(), key=lambda item: item[1], reverse=True)[:5]
[('NHGRI', 0.54200061864847038),
('NCATS', 0.49097267547279988),
('NIGMS', 0.47818129591411118),
('CIT', 0.46874987052158501),
('NLM', 0.46869259072562974)]
>>> sorted(classifier.score('cancer immunotherapy').items(), key=lambda item: item[1], reverse=True)[:5]
[('NCI', 0.53734097785976076),
('NIAID', 0.50616582142027433),
('NIDCR', 0.48596330887674788),
('NIDDK', 0.46875755765903215),
('NCCAM', 0.4642233792198418)]
The trained model can be saved:
>>> classifier.save_compact_model('/path/to/sumvec_nihdata_model.bin')
And with the same pre-trained Word2Vec model, this classifier can be loaded:
>>> classifier2 = shorttext.classifiers.SumEmbeddedVecClassifier.from_pretrained(wvmodel, '/path/to/sumvec_nihdata_model.bin')
- class shorttext.classifiers.embed.sumvec.SumEmbedVecClassification.SumEmbeddedVecClassifier(wvmodel: gensim.models.keyedvectors.KeyedVectors, vecsize: int | None = None, simfcn: callable | None = None)[source]
Bases:
CompactIOMachineClassifier using summed word embeddings.
Each class is represented as the sum of word embeddings for its training sentences, normalized to a unit vector. Prediction uses cosine similarity between the input vector and class centroids.
- Reference:
Pre-trained Word2Vec: https://code.google.com/archive/p/word2vec/
- __init__(wvmodel: gensim.models.keyedvectors.KeyedVectors, vecsize: int | None = None, simfcn: callable | None = None)[source]
Initialize the classifier.
- Args:
wvmodel: Word embedding model (e.g., Word2Vec). vecsize: Vector size. Default: None (extracted from model). simfcn: Similarity function. Default: cosine_similarity.
- train(classdict: dict[str, list[str]]) None[source]
Train the classifier.
- Args:
classdict: Training data with class labels as keys and texts as values.
- Raises:
ModelNotTrainedException: If not trained or loaded.
- savemodel(nameprefix: str) None[source]
Save the trained model.
- Args:
nameprefix: Prefix for output files.
- Raises:
ModelNotTrainedException: If not trained.
- loadmodel(nameprefix: str) None[source]
Load a trained model.
- Args:
nameprefix: Prefix for input files.
- shorttext_to_embedvec(shorttext: str) Annotated[ndarray[tuple[Any, ...], dtype[float64]], '1D Array'][source]
Convert short text to embedding vector.
- Args:
shorttext: Input text.
- Returns:
Normalized embedding vector.
- score(shorttext: str) dict[str, float][source]
Calculate classification scores for all class labels.
- Args:
shorttext: Input text.
- Returns:
Dictionary mapping class labels to scores.
- Raises:
ModelNotTrainedException: If not trained.
- classmethod from_pretrained(wvmodel: gensim.models.keyedvectors.KeyedVectors, name: str, compact: bool = True, vecsize: int | None = None) Self[source]
Load a SumEmbeddedVecClassifier from file.
- Args:
wvmodel: Word embedding model. name: Model name (compact) or prefix (non-compact). compact: Whether to load compact model. Default: True. vecsize: Vector size. Default: None.
- Returns:
SumEmbeddedVecClassifier instance.
Appendix: Model I/O in Previous Versions
In previous versions of shorttext, shorttext.classifiers.SumEmbeddedVecClassifier has a savemodel method,
which runs as follow:
>>> classifier.savemodel('/path/to/nihdata')
This produces the following file for this model:
/path/to/nihdata_embedvecdict.pkl
It can be loaded by:
>>> classifier2 = shorttext.classifiers.SumEmbeddedVecClassifier.from_pretrained(wvmodel, '/path/to/nihdata', compact=False)
Reference
Michael Czerny, “Modern Methods for Sentiment Analysis,” *District Data Labs (2015). [DistrictDataLabs]
Home: Homepage of shorttext