Supervised Classification with Topics as Features¶
Topic Vectors as Intermediate Feature Vectors¶
To perform classification using bag-of-words (BOW) model as features, nltk and gensim offered good framework. But the feature vectors of short text represented by BOW can be very sparse. And the relationships between words with similar meanings are ignored as well. One of the way to tackle this is to use topic modeling, i.e. representing the words in a topic vector. This package provides the following ways to model the topics:
- LDA (Latent Dirichlet Allocation)
- LSI (Latent Semantic Indexing)
- RP (Random Projections)
- Autoencoder
With the topic representations, users can use any supervised learning algorithm provided by scikit-learn to perform the classification.
Topic Models in gensim: LDA, LSI, and Random Projections¶
This package supports three algorithms provided by gensim, namely, LDA, LSI, and Random Projections, to do the topic modeling.
>>> import shorttext
First, load a set of training data (all NIH data in this example):
>>> trainclassdict = shorttext.data.nihreports(sample_size=None)
Initialize an instance of topic modeler, and use LDA as an example:
>>> topicmodeler = shorttext.generators.LDAModeler()
For other algorithms, user can use LSIModeler
for LSI or RPModeler
for RP. Everything else is the same.
To train with 128 topics, enter:
>>> topicmodeler.train(trainclassdict, 128)
After the training is done, the user can retrieve the topic vector representation with the trained model. For example,
>>> topicmodeler.retrieve_topicvec('stem cell research')
>>> topicmodeler.retrieve_topicvec('bioinformatics')
By default, the vectors are normalized. Another way to retrieve the topic vector representation is as follow:
>>> topicmodeler['stem cell research']
>>> topicmodeler['bioinformatics']
In the training and the retrieval above, the same preprocessing process is applied. Users can provide their own preprocessor while initiating the topic modeler.
Users can save the trained model by calling:
>>> topicmodeler.save_compact_model('/path/to/nihlda128.bin')
And the topic model can be retrieved by calling:
>>> topicmodeler2 = shorttext.generators.load_gensimtopicmodel('/path/to/nihlda128.bin')
While initialize the instance of the topic modeler, the user can also specify whether to weigh the terms using tf-idf (term frequency - inverse document frequency). The default is to weigh. To not weigh, initialize it as
>>> topicmodeler3 = shorttext.generators.GensimTopicModeler(toweigh=False)
Appendix: Model I/O in Previous Versions¶
For previous versions of shorttext, the trained models are saved by calling:
>>> topicmodeler.savemodel('/path/to/nihlda128')
However, we discourage users using this anymore, because the model I/O for various models in gensim have been different. It produces errors.
All of them have to be present in order to be loaded. Note that the preprocessor is not saved. To load the model, enter:
>>> topicmodeler2 = shorttext.classifiers.load_gensimtopicmodel('/path/to/nihlda128', compact=False)
-
class
shorttext.generators.bow.GensimTopicModeling.
GensimTopicModeler
(preprocessor=<function text_preprocessor.<locals>.<lambda>>, algorithm='lda', toweigh=True, normalize=True)¶ This class facilitates the creation of topic models (options: LDA (latent Dirichlet Allocation), LSI (latent semantic indexing), and Random Projections with the given short text training data, and convert future short text into topic vectors using the trained topic model.
No compact model I/O available for this class. Refer to
LDAModeler
andLSIModeler
.This class extends
LatentTopicModeler
.-
get_batch_cos_similarities
(shorttext)¶ Calculate the score, which is the cosine similarity with the topic vector of the model, of the short text against each class labels.
If neither
train()
norloadmodel()
was run, it will raise ModelNotTrainedException.Parameters: shorttext (str) – short text Returns: dictionary of scores of the text to all classes Raise: ModelNotTrainedException Return type: dict
-
loadmodel
(nameprefix)¶ Load the topic model with the given prefix of the file paths.
Given the prefix of the file paths, load the corresponding topic model. The files include a JSON (.json) file that specifies various parameters, a gensim dictionary (.gensimdict), and a topic model (.gensimmodel). If weighing is applied, load also the tf-idf model (.gensimtfidf).
Parameters: nameprefix (str) – prefix of the file paths Returns: None
-
retrieve_corpus_topicdist
(shorttext)¶ Calculate the topic vector representation of the short text, in the corpus form.
If neither
train()
norloadmodel()
was run, it will raise ModelNotTrainedException.Parameters: shorttext (str) – text to be represented Returns: topic vector in the corpus form Raise: ModelNotTrainedException Return type: list
-
retrieve_topicvec
(shorttext)¶ Calculate the topic vector representation of the short text.
This function calls
retrieve_corpus_topicdist()
.If neither
train()
norloadmodel()
was run, it will raise ModelNotTrainedException.Parameters: shorttext (str) – text to be represented Returns: topic vector Raise: ModelNotTrainedException Return type: numpy.ndarray
-
savemodel
(nameprefix)¶ Save the model with names according to the prefix.
Given the prefix of the file paths, save the corresponding topic model. The files include a JSON (.json) file that specifies various parameters, a gensim dictionary (.gensimdict), and a topic model (.gensimmodel). If weighing is applied, load also the tf-idf model (.gensimtfidf).
If neither
train()
norloadmodel()
was run, it will raise ModelNotTrainedException.Parameters: nameprefix (str) – prefix of the file paths Returns: None Raise: ModelNotTrainedException
-
train
(classdict, nb_topics, *args, **kwargs)¶ Train the topic modeler.
Parameters: - classdict (dict) – training data
- nb_topics (int) – number of latent topics
- args – arguments to pass to the train method for gensim topic models
- kwargs – arguments to pass to the train method for gensim topic models
Returns: None
-
update
(additional_classdict)¶ Update the model with additional data.
It updates the topic model with additional data.
Warning: It does not allow adding class labels, and new words. The dictionary is not changed. Therefore, such an update will alter the topic model only. It affects the topic vector representation. While the corpus is changed, the words pumped into calculating the similarity matrix is not changed.
Therefore, this function means for a fast update. But if you want a comprehensive model, it is recommended to retrain.
Parameters: additional_classdict (dict) – additional training data Returns: None
-
-
class
shorttext.generators.bow.GensimTopicModeling.
LDAModeler
(preprocessor=<function text_preprocessor.<locals>.<lambda>>, toweigh=True, normalize=True)¶ This class facilitates the creation of LDA (latent Dirichlet Allocation) topic models, with the given short text training data, and convert future short text into topic vectors using the trained topic model.
This class extends
GensimTopicModeler
.
-
class
shorttext.generators.bow.GensimTopicModeling.
LSIModeler
(preprocessor=<function text_preprocessor.<locals>.<lambda>>, toweigh=True, normalize=True)¶ This class facilitates the creation of LSI (latent semantic indexing) topic models, with the given short text training data, and convert future short text into topic vectors using the trained topic model.
This class extends
GensimTopicModeler
.
-
class
shorttext.generators.bow.GensimTopicModeling.
RPModeler
(preprocessor=<function text_preprocessor.<locals>.<lambda>>, toweigh=True, normalize=True)¶ This class facilitates the creation of RP (random projection) topic models, with the given short text training data, and convert future short text into topic vectors using the trained topic model.
This class extends
GensimTopicModeler
.
-
shorttext.generators.bow.GensimTopicModeling.
load_gensimtopicmodel
(name, preprocessor=<function text_preprocessor.<locals>.<lambda>>, compact=True)¶ Load the gensim topic modeler from files.
Parameters: - name (str) – name (if compact=True) or prefix (if compact=False) of the file path
- preprocessor (function) – function that preprocesses the text. (Default: shorttext.utils.textpreprocess.standard_text_preprocessor_1)
- compact (bool) – whether model file is compact (Default: True)
Returns: a topic modeler
Return type:
AutoEncoder¶
Note: Previous version (<=0.2.1) of this autoencoder has a serious bug. Current version is incompatible with the autoencoder of version <=0.2.1 .
Another way to find a new topic vector representation is to use the autoencoder, a neural network model which compresses a vector representation into another one of a shorter (or longer, rarely though) representation, by minimizing the difference between the input layer and the decoding layer. For faster demonstration, use the subject keywords as the example dataset:
>>> subdict = shorttext.data.subjectkeywords()
To train such a model, we perform in a similar way with the LDA model (or LSI and random projections above):
>>> autoencoder = shorttext.generators.AutoencodingTopicModeler()
>>> autoencoder.train(subdict, 8)
After the training is done, the user can retrieve the encoded vector representation with the trained autoencoder model. For example,
>>> autoencoder.retrieve_topicvec('linear algebra')
>>> autoencoder.retrieve_topicvec('path integral')
By default, the vectors are normalized. Another way to retrieve the topic vector representation is as follow:
>>> autoencoder['linear algebra']
>>> autoencoder['path integral']
In the training and the retrieval above, the same preprocessing process is applied. Users can provide their own preprocessor while initiating the topic modeler.
Users can save the trained models, by calling:
>>> autoencoder.save_compact_model('/path/to/sub_autoencoder8.bin')
And the model can be retrieved by calling:
>>> autoencoder2 = shorttext.generators.load_autoencoder_topicmodel('/path/to/sub_autoencoder8.bin')
Like other topic models, while initialize the instance of the topic modeler, the user can also specify whether to weigh the terms using tf-idf (term frequency - inverse document frequency). The default is to weigh. To not weigh, initialize it as:
>>> autoencoder3 = shorttext.generators.AutoencodingTopicModeler(toweigh=False)
-
class
shorttext.generators.bow.AutoEncodingTopicModeling.
AutoencodingTopicModeler
(preprocessor=<function text_preprocessor.<locals>.<lambda>>, normalize=True)¶ This class facilitates the topic modeling of input training data using the autoencoder.
A reference about how an autoencoder is written with keras by Francois Chollet, titled Building Autoencoders in Keras .
This class extends
LatentTopicModeler
.-
get_batch_cos_similarities
(shorttext)¶ Calculate the score, which is the cosine similarity with the topic vector of the model, of the short text against each class labels.
If neither
train()
norloadmodel()
was run, it will raise ModelNotTrainedException.Parameters: shorttext (str) – short text Returns: dictionary of scores of the text to all classes Raise: ModelNotTrainedException Return type: dict
-
loadmodel
(nameprefix, load_incomplete=False)¶ Save the model with names according to the prefix.
Given the prefix of the file paths, load the model into files, with name given by the prefix. There are files with names ending with “_encoder.json” and “_encoder.h5”, which are the JSON and HDF5 files for the encoder respectively. They also include a gensim dictionary (.gensimdict).
Parameters: - nameprefix (str) – prefix of the paths of the file
- load_incomplete (bool) – load encoder only, not decoder and autoencoder file (Default: False; put True for model built in version <= 0.2.1)
Returns: None
-
precalculate_liststr_topicvec
(shorttexts)¶ Calculate the summed topic vectors for training data for each class.
This function is called while training.
Parameters: shorttexts (list) – list of short texts Returns: average topic vector Raise: ModelNotTrainedException Return type: numpy.ndarray
-
retrieve_topicvec
(shorttext)¶ Calculate the topic vector representation of the short text.
If neither
train()
norloadmodel()
was run, it will raise ModelNotTrainedException.Parameters: shorttext (str) – short text Returns: encoded vector representation of the short text Raise: ModelNotTrainedException Return type: numpy.ndarray
-
savemodel
(nameprefix, save_complete_autoencoder=True)¶ Save the model with names according to the prefix.
Given the prefix of the file paths, save the model into files, with name given by the prefix. There are files with names ending with “_encoder.json” and “_encoder.h5”, which are the JSON and HDF5 files for the encoder respectively. They also include a gensim dictionary (.gensimdict).
If save_complete_autoencoder is True, then there are also files with names ending with “_decoder.json” and “_decoder.h5”.
If neither
train()
norloadmodel()
was run, it will raise ModelNotTrainedException.Parameters: - nameprefix (str) – prefix of the paths of the file
- save_complete_autoencoder (bool) – whether to store the decoder and the complete autoencoder (Default: True; but False for version <= 0.2.1)
Returns: None
-
train
(classdict, nb_topics, *args, **kwargs)¶ Train the autoencoder.
Parameters: - classdict (dict) – training data
- nb_topics (int) – number of topics, i.e., the number of encoding dimensions
- args – arguments to be passed to keras model fitting
- kwargs – arguments to be passed to keras model fitting
Returns: None
-
-
shorttext.generators.bow.AutoEncodingTopicModeling.
load_autoencoder_topicmodel
(name, preprocessor=<function text_preprocessor.<locals>.<lambda>>, compact=True)¶ Load the autoencoding topic model from files.
Parameters: - name (str) – name (if compact=True) or prefix (if compact=False) of the paths of the model files
- preprocessor (function) – function that preprocesses the text. (Default: shorttext.utils.textpreprocess.standard_text_preprocessor_1)
- compact (bool) – whether model file is compact (Default: True)
Returns: an autoencoder as a topic modeler
Return type: generators.bow.AutoEncodingTopicModeling.AutoencodingTopicModeler
Appendix: Unzipping Model I/O¶
For previous versions of shorttext, the trained models are saved by calling:
>>> autoencoder.savemodel('/path/to/sub_autoencoder8')
The following files are produced for the autoencoder:
/path/to/sub_autoencoder.json
/path/to/sub_autoencoder.gensimdict
/path/to/sub_autoencoder_encoder.json
/path/to/sub_autoencoder_encoder.h5
/path/to/sub_autoencoder_classtopicvecs.pkl
If specifying save_complete_autoencoder=True, then four more files are found:
/path/to/sub_autoencoder_decoder.json
/path/to/sub_autoencoder_decoder.h5
/path/to/sub_autoencoder_autoencoder.json
/path/to/sub_autoencoder_autoencoder.h5
Users can load the same model later by entering:
>>> autoencoder2 = shorttext.classifiers.load_autoencoder_topic('/path/to/sub_autoencoder8', compact=False)
Abstract Latent Topic Modeling Class¶
Both shorttext.generators.GensimTopicModeler
and
shorttext.generators.AutoencodingTopicModeler
extends
shorttext.generators.bow.LatentTopicModeling.LatentTopicModeler
,
an abstract class virtually. If user wants to develop its own topic model that extends
this, he has to define the methods train, retrieve_topic_vec, loadmodel, and
savemodel.
-
class
shorttext.generators.bow.LatentTopicModeling.
LatentTopicModeler
(preprocessor=<function text_preprocessor.<locals>.<lambda>>, normalize=True)¶ Abstract class for various topic modeler.
-
generate_corpus
(classdict)¶ Calculate the gensim dictionary and corpus, and extract the class labels from the training data. Called by
train()
.Parameters: classdict (dict) – training data Returns: None
-
get_batch_cos_similarities
(shorttext)¶ Calculate the cosine similarities of the given short text and all the class labels.
This is an abstract method of this abstract class, which raise the NotImplementedException.
Parameters: shorttext (str) – short text Returns: topic vector Raise: NotImplementedException Return type: numpy.ndarray
-
loadmodel
(nameprefix)¶ Load the model from files.
This is an abstract method of this abstract class, which raise the NotImplementedException.
Parameters: nameprefix (str) – prefix of the paths of the model files Returns: None Raise: NotImplementedException
-
retrieve_bow
(shorttext)¶ Calculate the gensim bag-of-words representation of the given short text.
Parameters: shorttext (str) – text to be represented Returns: corpus representation of the text Return type: list
-
retrieve_bow_vector
(shorttext, normalize=True)¶ Calculate the vector representation of the bag-of-words in terms of numpy.ndarray.
Parameters: - shorttext (str) – short text
- normalize (bool) – whether the retrieved topic vectors are normalized. (Default: True)
Returns: vector represtation of the text
Return type: numpy.ndarray
-
retrieve_topicvec
(shorttext)¶ Calculate the topic vector representation of the short text.
This is an abstract method of this abstract class, which raise the NotImplementedException.
Parameters: shorttext (str) – short text Returns: topic vector Raise: NotImplementedException Return type: numpy.ndarray
-
savemodel
(nameprefix)¶ Save the model to files.
This is an abstract method of this abstract class, which raise the NotImplementedException.
Parameters: nameprefix (str) – prefix of the paths of the model files Returns: None Raise: NotImplementedException
-
train
(classdict, nb_topics, *args, **kwargs)¶ Train the modeler.
This is an abstract method of this abstract class, which raise the NotImplementedException.
Parameters: - classdict (dict) – training data
- nb_topics (int) – number of latent topics
- args – arguments to be passed into the wrapped training functions
- kwargs – arguments to be passed into the wrapped training functions
Returns: None
Raise: NotImplementedException
-
-
class
shorttext.generators.bow.GensimTopicModeling.
GensimTopicModeler
(preprocessor=<function text_preprocessor.<locals>.<lambda>>, algorithm='lda', toweigh=True, normalize=True) This class facilitates the creation of topic models (options: LDA (latent Dirichlet Allocation), LSI (latent semantic indexing), and Random Projections with the given short text training data, and convert future short text into topic vectors using the trained topic model.
No compact model I/O available for this class. Refer to
LDAModeler
andLSIModeler
.This class extends
LatentTopicModeler
.-
get_batch_cos_similarities
(shorttext) Calculate the score, which is the cosine similarity with the topic vector of the model, of the short text against each class labels.
If neither
train()
norloadmodel()
was run, it will raise ModelNotTrainedException.Parameters: shorttext (str) – short text Returns: dictionary of scores of the text to all classes Raise: ModelNotTrainedException Return type: dict
-
loadmodel
(nameprefix) Load the topic model with the given prefix of the file paths.
Given the prefix of the file paths, load the corresponding topic model. The files include a JSON (.json) file that specifies various parameters, a gensim dictionary (.gensimdict), and a topic model (.gensimmodel). If weighing is applied, load also the tf-idf model (.gensimtfidf).
Parameters: nameprefix (str) – prefix of the file paths Returns: None
-
retrieve_corpus_topicdist
(shorttext) Calculate the topic vector representation of the short text, in the corpus form.
If neither
train()
norloadmodel()
was run, it will raise ModelNotTrainedException.Parameters: shorttext (str) – text to be represented Returns: topic vector in the corpus form Raise: ModelNotTrainedException Return type: list
-
retrieve_topicvec
(shorttext) Calculate the topic vector representation of the short text.
This function calls
retrieve_corpus_topicdist()
.If neither
train()
norloadmodel()
was run, it will raise ModelNotTrainedException.Parameters: shorttext (str) – text to be represented Returns: topic vector Raise: ModelNotTrainedException Return type: numpy.ndarray
-
savemodel
(nameprefix) Save the model with names according to the prefix.
Given the prefix of the file paths, save the corresponding topic model. The files include a JSON (.json) file that specifies various parameters, a gensim dictionary (.gensimdict), and a topic model (.gensimmodel). If weighing is applied, load also the tf-idf model (.gensimtfidf).
If neither
train()
norloadmodel()
was run, it will raise ModelNotTrainedException.Parameters: nameprefix (str) – prefix of the file paths Returns: None Raise: ModelNotTrainedException
-
train
(classdict, nb_topics, *args, **kwargs) Train the topic modeler.
Parameters: - classdict (dict) – training data
- nb_topics (int) – number of latent topics
- args – arguments to pass to the train method for gensim topic models
- kwargs – arguments to pass to the train method for gensim topic models
Returns: None
-
update
(additional_classdict) Update the model with additional data.
It updates the topic model with additional data.
Warning: It does not allow adding class labels, and new words. The dictionary is not changed. Therefore, such an update will alter the topic model only. It affects the topic vector representation. While the corpus is changed, the words pumped into calculating the similarity matrix is not changed.
Therefore, this function means for a fast update. But if you want a comprehensive model, it is recommended to retrain.
Parameters: additional_classdict (dict) – additional training data Returns: None
-
-
class
shorttext.generators.bow.GensimTopicModeling.
LDAModeler
(preprocessor=<function text_preprocessor.<locals>.<lambda>>, toweigh=True, normalize=True) This class facilitates the creation of LDA (latent Dirichlet Allocation) topic models, with the given short text training data, and convert future short text into topic vectors using the trained topic model.
This class extends
GensimTopicModeler
.
-
class
shorttext.generators.bow.GensimTopicModeling.
LSIModeler
(preprocessor=<function text_preprocessor.<locals>.<lambda>>, toweigh=True, normalize=True) This class facilitates the creation of LSI (latent semantic indexing) topic models, with the given short text training data, and convert future short text into topic vectors using the trained topic model.
This class extends
GensimTopicModeler
.
-
class
shorttext.generators.bow.GensimTopicModeling.
RPModeler
(preprocessor=<function text_preprocessor.<locals>.<lambda>>, toweigh=True, normalize=True) This class facilitates the creation of RP (random projection) topic models, with the given short text training data, and convert future short text into topic vectors using the trained topic model.
This class extends
GensimTopicModeler
.
-
shorttext.generators.bow.GensimTopicModeling.
load_gensimtopicmodel
(name, preprocessor=<function text_preprocessor.<locals>.<lambda>>, compact=True) Load the gensim topic modeler from files.
Parameters: - name (str) – name (if compact=True) or prefix (if compact=False) of the file path
- preprocessor (function) – function that preprocesses the text. (Default: shorttext.utils.textpreprocess.standard_text_preprocessor_1)
- compact (bool) – whether model file is compact (Default: True)
Returns: a topic modeler
Return type:
Appendix: Namespaces for Topic Modeler in Previous Versions¶
All generative topic modeling algorithms were placed under the package shorttext.classifiers for version <=0.3.4. In current version (>= 0.3.5), however, all generative models will be moved to shorttext.generators, while any classifiers making use of these topic models are still kept under shorttext.classifiers. A list include:
shorttext.classifiers.GensimTopicModeler -> shorttext.generators.GensimTopicModeler
shorttext.classifiers.LDAModeler -> shorttext.generators.LDAModeler
shorttext.classifiers.LSIModeler -> shorttext.generators.LSIModeler
shorttext.classifiers.RPModeler -> shorttext.generators.RPModeler
shorttext.classifiers.AutoencodingTopicModeler -> shorttext.generators.AutoencodingTopicModeler
shorttext.classifiers.load_gensimtopicmodel -> shorttext.generators.load_gensimtopicmodel
shorttext.classifiers.load_autoencoder_topic -> shorttext.generators.load_autoencoder_topicmodel
Before release 0.5.6, for backward compatibility, developers can still call the topic models as if there were no such changes, although they are advised to make this change. However, effective release 0.5.7, this backward compatibility is no longer available.
Classification Using Cosine Similarity¶
The topic modelers are trained to represent the short text in terms of a topic vector, effectively the feature vector. However, to perform supervised classification, there needs a classification algorithm. The first one is to calculate the cosine similarities between topic vectors of the given short text with those of the texts in all class labels.
If there is already a trained topic modeler, whether it is
shorttext.generators.GensimTopicModeler
or
shorttext.generators.AutoencodingTopicModeler
,
a classifier based on cosine similarities can be initiated
immediately without training. Taking the LDA example above, such classifier can be initiated as follow:
>>> cos_classifier = shorttext.classifiers.TopicVectorCosineDistanceClassifier(topicmodeler)
Or if the user already saved the topic modeler, one can initiate the same classifier by loading the topic modeler:
>>> cos_classifier = shorttext.classifiers.load_gensimtopicvec_cosineClassifier('/path/to/nihlda128.bin')
To perform prediction, enter:
>>> cos_classifier.score('stem cell research')
which outputs a dictionary with labels and the corresponding scores.
The same thing for autoencoder, but the classifier based on autoencoder can be loaded by another function:
>>> cos_classifier = shorttext.classifiers.load_autoencoder_cosineClassifier('/path/to/sub_autoencoder8.bin')
-
class
shorttext.classifiers.bow.topic.TopicVectorDistanceClassification.
TopicVecCosineDistanceClassifier
(topicmodeler)¶ This is a class that implements a classifier that perform classification based on the cosine similarity between the topic vectors of the user-input short texts and various classes. The topic vectors are calculated using
LatentTopicModeler
.-
loadmodel
(nameprefix)¶ Load the topic model with the given prefix of the file paths.
Given the prefix of the file paths, load the corresponding topic model. The files include a JSON (.json) file that specifies various parameters, a gensim dictionary (.gensimdict), and a topic model (.gensimmodel). If weighing is applied, load also the tf-idf model (.gensimtfidf).
This is essentialing loading the topic modeler
LatentTopicModeler
.Parameters: nameprefix (str) – prefix of the file paths Returns: None
-
savemodel
(nameprefix)¶ Save the model with names according to the prefix.
Given the prefix of the file paths, save the corresponding topic model. The files include a JSON (.json) file that specifies various parameters, a gensim dictionary (.gensimdict), and a topic model (.gensimmodel). If weighing is applied, load also the tf-idf model (.gensimtfidf).
If neither
train()
norloadmodel()
was run, it will raise ModelNotTrainedException.This is essentialing saving the topic modeler
LatentTopicModeler
.Parameters: nameprefix (str) – prefix of the file paths Returns: None Raise: ModelNotTrainedException
-
score
(shorttext)¶ Calculate the score, which is the cosine similarity with the topic vector of the model, of the short text against each class labels.
Parameters: shorttext (str) – short text Returns: dictionary of scores of the text to all classes Return type: dict
-
-
shorttext.classifiers.bow.topic.TopicVectorDistanceClassification.
load_autoencoder_cosineClassifier
(name, preprocessor=<function text_preprocessor.<locals>.<lambda>>, compact=True)¶ Load an autoencoder from files for topic modeling, and return a cosine classifier.
Given the prefix of the file paths, load the model into files, with name given by the prefix. There are files with names ending with “_encoder.json” and “_encoder.h5”, which are the JSON and HDF5 files for the encoder respectively. They also include a gensim dictionary (.gensimdict).
Parameters: - name (str) – name (if compact=True) or prefix (if compact=False) of the file paths
- preprocessor (function) – function that preprocesses the text. (Default: utils.textpreprocess.standard_text_preprocessor_1)
- compact (bool) – whether model file is compact (Default: True)
Returns: a classifier that scores the short text based on the autoencoder
Return type:
-
shorttext.classifiers.bow.topic.TopicVectorDistanceClassification.
load_gensimtopicvec_cosineClassifier
(name, preprocessor=<function text_preprocessor.<locals>.<lambda>>, compact=True)¶ Load a gensim topic model from files and return a cosine distance classifier.
Given the prefix of the files of the topic model, return a cosine distance classifier based on this model, i.e.,
TopicVecCosineDistanceClassifier
.The files include a JSON (.json) file that specifies various parameters, a gensim dictionary (.gensimdict), and a topic model (.gensimmodel). If weighing is applied, load also the tf-idf model (.gensimtfidf).
Parameters: - name (str) – name (if compact=True) or prefix (if compact=False) of the file paths
- preprocessor (function) – function that preprocesses the text. (Default: utils.textpreprocess.standard_text_preprocessor_1)
- compact (bool) – whether model file is compact (Default: True)
Returns: a classifier that scores the short text based on the topic model
Return type:
-
shorttext.classifiers.bow.topic.TopicVectorDistanceClassification.
train_autoencoder_cosineClassifier
(classdict, nb_topics, preprocessor=<function text_preprocessor.<locals>.<lambda>>, normalize=True, *args, **kwargs)¶ Return a cosine distance classifier, i.e.,
TopicVecCosineDistanceClassifier
, while training an autoencoder as a topic model in between.Parameters: - classdict (dict) – training data
- nb_topics (int) – number of topics, i.e., number of encoding dimensions
- preprocessor (function) – function that preprocesses the text. (Default: utils.textpreprocess.standard_text_preprocessor_1)
- normalize (bool) – whether the retrieved topic vectors are normalized. (Default: True)
- args – arguments to be passed to keras model fitting
- kwargs – arguments to be passed to keras model fitting
Returns: a classifier that scores the short text based on the autoencoder
Return type:
-
shorttext.classifiers.bow.topic.TopicVectorDistanceClassification.
train_gensimtopicvec_cosineClassifier
(classdict, nb_topics, preprocessor=<function text_preprocessor.<locals>.<lambda>>, algorithm='lda', toweigh=True, normalize=True, *args, **kwargs)¶ Return a cosine distance classifier, i.e.,
TopicVecCosineDistanceClassifier
, while training a gensim topic model in between.Parameters: - classdict (dict) – training data
- nb_topics (int) – number of latent topics
- preprocessor (function) – function that preprocesses the text. (Default: utils.textpreprocess.standard_text_preprocessor_1)
- algorithm (str) – algorithm for topic modeling. Options: lda, lsi, rp. (Default: lda)
- toweigh (bool) – whether to weigh the words using tf-idf. (Default: True)
- normalize (bool) – whether the retrieved topic vectors are normalized. (Default: True)
- args – arguments to pass to the train method for gensim topic models
- kwargs – arguments to pass to the train method for gensim topic models
Returns: a classifier that scores the short text based on the topic model
Return type:
Classification Using Scikit-Learn Classifiers¶
The topic modeler can be used to generate features used for other machine learning algorithms. We can take any supervised learning algorithms in scikit-learn here. We use Gaussian naive Bayes as an example. For faster demonstration, use the subject keywords as the example dataset.
>>> subtopicmodeler = shorttext.generators.LDAModeler()
>>> subtopicmodeler.train(subdict, 8)
We first import the class:
>>> from sklearn.naive_bayes import GaussianNB
And we train the classifier:
>>> classifier = shorttext.classifiers.TopicVectorSkLearnClassifier(subtopicmodeler, GaussianNB())
>>> classifier.train(subdict)
Predictions can be performed like the following example:
>>> classifier.score('functional integral')
which outputs a dictionary with labels and the corresponding scores.
You can save the model by:
>>> classifier.save_compact_model('/path/to/sublda8nb.bin')
where the argument specifies the prefix of the path of the model files, including the topic models, and the scikit-learn model files. The classifier can be loaded by calling:
>>> classifier2 = shorttext.classifiers.load_gensim_topicvec_sklearnclassifier('/path/to/sublda8nb.bin')
The topic modeler here can also be an autoencoder, by putting subtopicmodeler as the autoencoder will still do the work. However, to load the saved classifier with an autoencoder model, do
>>> classifier2 = shorttext.classifiers.load_autoencoder_topic_sklearnclassifier('/path/to/filename.bin')
Compact model files saved by TopicVectorSkLearnClassifier in shorttext >= 1.0.0 cannot be read by earlier version of shorttext; vice versa is not true though: old compact model files can be read in.
-
class
shorttext.classifiers.bow.topic.SkLearnClassification.
TopicVectorSkLearnClassifier
(topicmodeler, sklearn_classifier)¶ This is a classifier that wraps any supervised learning algorithm in scikit-learn, and use the topic vectors output by the topic modeler
LatentTopicModeler
that wraps the topic models in gensim.# Reference
Xuan Hieu Phan, Cam-Tu Nguyen, Dieu-Thu Le, Minh Le Nguyen, Susumu Horiguchi, Quang-Thuy Ha, “A Hidden Topic-Based Framework toward Building Applications with Short Web Documents,” IEEE Trans. Knowl. Data Eng. 23(7): 961-976 (2011).
Xuan Hieu Phan, Le-Minh Nguyen, Susumu Horiguchi, “Learning to Classify Short and Sparse Text & Web withHidden Topics from Large-scale Data Collections,” WWW ‘08 Proceedings of the 17th international conference on World Wide Web. (2008) [ACL]
-
classify
(shorttext)¶ Give the highest-scoring class of the given short text according to the classifier.
If neither
train()
norloadmodel()
was run, or if the topic model was not trained, it will raise ModelNotTrainedException.Parameters: shorttext (str) – short text Returns: class label of the classification result of the given short text Raise: ModelNotTrainedException Return type: str
-
getvector
(shorttext)¶ Retrieve the topic vector representation of the given short text.
If the topic modeler does not have a trained model, it will raise ModelNotTrainedException.
Parameters: shorttext (str) – short text Returns: topic vector representation Raise: ModelNotTrainedException Return type: numpy.ndarray
-
load_compact_model
(name)¶ Load the classification model together with the topic model from a compact file.
Parameters: name (str) – name of the compact model file Returns: None
-
loadmodel
(nameprefix)¶ Load the classification model together with the topic model.
Parameters: nameprefix (str) – prefix of the paths of the model files Returns: None
-
save_compact_model
(name)¶ Save the model.
Save the topic model and the trained scikit-learn classification model in one compact model file.
If neither
train()
norloadmodel()
was run, or if the topic model was not trained, it will raise ModelNotTrainedException.Parameters: name (str) – name of the compact model file Returns: None
-
savemodel
(nameprefix)¶ Save the model.
Save the topic model and the trained scikit-learn classification model. The scikit-learn model will have the name nameprefix followed by the extension .pkl. The topic model is the same as the one in LatentTopicModeler.
If neither
train()
norloadmodel()
was run, or if the topic model was not trained, it will raise ModelNotTrainedException.Parameters: nameprefix (str) – prefix of the paths of the model files Returns: None Raise: ModelNotTrainedException
-
score
(shorttext)¶ Calculate the score, which is the cosine similarity with the topic vector of the model, of the short text against each class labels.
If neither
train()
norloadmodel()
was run, or if the topic model was not trained, it will raise ModelNotTrainedException.Parameters: shorttext (str) – short text Returns: dictionary of scores of the text to all classes Raise: ModelNotTrainedException Return type: dict
-
train
(classdict, *args, **kwargs)¶ Train the classifier.
If the topic modeler does not have a trained model, it will raise ModelNotTrainedException.
Parameters: - classdict (dict) – training data
- args – arguments to be passed to the fit method of the scikit-learn classifier
- kwargs – arguments to be passed to the fit method of the scikit-learn classifier
Returns: None
Raise: ModelNotTrainedException
-
-
shorttext.classifiers.bow.topic.SkLearnClassification.
load_autoencoder_topic_sklearnclassifier
(name, preprocessor=<function text_preprocessor.<locals>.<lambda>>, compact=True)¶ - Load the classifier, a wrapper that uses scikit-learn classifier, with
- feature vectors given by an autocoder topic model, from files.
# Reference
Xuan Hieu Phan, Cam-Tu Nguyen, Dieu-Thu Le, Minh Le Nguyen, Susumu Horiguchi, Quang-Thuy Ha, “A Hidden Topic-Based Framework toward Building Applications with Short Web Documents,” IEEE Trans. Knowl. Data Eng. 23(7): 961-976 (2011).
Xuan Hieu Phan, Le-Minh Nguyen, Susumu Horiguchi, “Learning to Classify Short and Sparse Text & Web withHidden Topics from Large-scale Data Collections,” WWW ‘08 Proceedings of the 17th international conference on World Wide Web. (2008) [ACL]
Parameters: - name (str) – name (if compact==True) or prefix (if compact==False) of the paths of model files
- preprocessor (function) – function that preprocesses the text (Default: utils.textpreprocess.standard_text_preprocessor_1)
- compact (bool) – whether model file is compact (Default: True)
Returns: a trained classifier
Return type:
-
shorttext.classifiers.bow.topic.SkLearnClassification.
load_gensim_topicvec_sklearnclassifier
(name, preprocessor=<function text_preprocessor.<locals>.<lambda>>, compact=True)¶ - Load the classifier, a wrapper that uses scikit-learn classifier, with
- feature vectors given by a topic model, from files.
# Reference
Xuan Hieu Phan, Cam-Tu Nguyen, Dieu-Thu Le, Minh Le Nguyen, Susumu Horiguchi, Quang-Thuy Ha, “A Hidden Topic-Based Framework toward Building Applications with Short Web Documents,” IEEE Trans. Knowl. Data Eng. 23(7): 961-976 (2011).
Xuan Hieu Phan, Le-Minh Nguyen, Susumu Horiguchi, “Learning to Classify Short and Sparse Text & Web withHidden Topics from Large-scale Data Collections,” WWW ‘08 Proceedings of the 17th international conference on World Wide Web. (2008) [ACL]
Parameters: - name (str) – name (if compact==True) or prefix (if compact==False) of the paths of model files
- preprocessor (function) – function that preprocesses the text (Default: utils.textpreprocess.standard_text_preprocessor_1)
- compact (bool) – whether model file is compact (Default: True)
Returns: a trained classifier
Return type:
-
shorttext.classifiers.bow.topic.SkLearnClassification.
train_autoencoder_topic_sklearnclassifier
(classdict, nb_topics, sklearn_classifier, preprocessor=<function text_preprocessor.<locals>.<lambda>>, normalize=True, keras_paramdict={}, sklearn_paramdict={})¶ Train the supervised learning classifier, with features given by topic vectors.
It trains an autoencoder topic model, and with its encoded vector representation, train a supervised learning classifier. The instantiated (not trained) scikit-learn classifier must be passed into the argument.
# Reference
Xuan Hieu Phan, Cam-Tu Nguyen, Dieu-Thu Le, Minh Le Nguyen, Susumu Horiguchi, Quang-Thuy Ha, “A Hidden Topic-Based Framework toward Building Applications with Short Web Documents,” IEEE Trans. Knowl. Data Eng. 23(7): 961-976 (2011).
Xuan Hieu Phan, Le-Minh Nguyen, Susumu Horiguchi, “Learning to Classify Short and Sparse Text & Web withHidden Topics from Large-scale Data Collections,” WWW ‘08 Proceedings of the 17th international conference on World Wide Web. (2008) [ACL]
Parameters: - classdict (dict) – training data
- nb_topics (int) – number topics, i.e., number of encoding dimensions
- sklearn_classifier (sklearn.base.BaseEstimator) – instantiated scikit-learn classifier
- preprocessor (function) – function that preprocesses the text (Default: utils.textpreprocess.standard_text_preprocessor_1)
- normalize (bool) – whether the retrieved topic vectors are normalized (Default: True)
- keras_paramdict – arguments to be passed to keras for training autoencoder
- sklearn_paramdict – arguemtnst to be passed to scikit-learn for fitting the classifier
Returns: a trained classifier
Return type:
-
shorttext.classifiers.bow.topic.SkLearnClassification.
train_gensim_topicvec_sklearnclassifier
(classdict, nb_topics, sklearn_classifier, preprocessor=<function text_preprocessor.<locals>.<lambda>>, topicmodel_algorithm='lda', toweigh=True, normalize=True, gensim_paramdict={}, sklearn_paramdict={})¶ Train the supervised learning classifier, with features given by topic vectors.
It trains a topic model, and with its topic vector representation, train a supervised learning classifier. The instantiated (not trained) scikit-learn classifier must be passed into the argument.
# Reference
Xuan Hieu Phan, Cam-Tu Nguyen, Dieu-Thu Le, Minh Le Nguyen, Susumu Horiguchi, Quang-Thuy Ha, “A Hidden Topic-Based Framework toward Building Applications with Short Web Documents,” IEEE Trans. Knowl. Data Eng. 23(7): 961-976 (2011).
Xuan Hieu Phan, Le-Minh Nguyen, Susumu Horiguchi, “Learning to Classify Short and Sparse Text & Web withHidden Topics from Large-scale Data Collections,” WWW ‘08 Proceedings of the 17th international conference on World Wide Web. (2008) [ACL]
Parameters: - classdict (dict) – training data
- nb_topics (int) – number of topics in the topic model
- sklearn_classifier (sklearn.base.BaseEstimator) – instantiated scikit-learn classifier
- preprocessor (function) – function that preprocesses the text (Default: utils.textpreprocess.standard_text_preprocessor_1)
- topicmodel_algorithm (str) – topic model algorithm (Default: ‘lda’)
- toweigh (bool) – whether to weigh the words using tf-idf (Default: True)
- normalize (bool) – whether the retrieved topic vectors are normalized (Default: True)
- gensim_paramdict (dict) – arguments to be passed on to the train method of the gensim topic model
- sklearn_paramdict (dict) – arguments to be passed on to the fit method of the sklearn classification algorithm
Returns: a trained classifier
Return type:
Notes about Text Preprocessing¶
The topic models are based on bag-of-words model, and text preprocessing is very important. However, the text preprocessing step cannot be serialized. The users should keep track of the text preprocessing step on their own. Unless it is necessary, use the standard preprocessing.
See more: Text Preprocessing .
Reference¶
David M. Blei, “Probabilistic Topic Models,” Communications of the ACM 55(4): 77-84 (2012).
Francois Chollet, “Building Autoencoders in Keras,” The Keras Blog. [Keras]
Xuan Hieu Phan, Cam-Tu Nguyen, Dieu-Thu Le, Minh Le Nguyen, Susumu Horiguchi, Quang-Thuy Ha, “A Hidden Topic-Based Framework toward Building Applications with Short Web Documents,” IEEE Trans. Knowl. Data Eng. 23(7): 961-976 (2011).
Xuan Hieu Phan, Le-Minh Nguyen, Susumu Horiguchi, “Learning to Classify Short and Sparse Text & Web withHidden Topics from Large-scale Data Collections,” WWW ‘08 Proceedings of the 17th international conference on World Wide Web. (2008) [ACL]
Home: Homepage of shorttext