Supervised Classification with Topics as Features

Topic Vectors as Intermediate Feature Vectors

To perform classification using bag-of-words (BOW) model as features, nltk and gensim offered good framework. But the feature vectors of short text represented by BOW can be very sparse. And the relationships between words with similar meanings are ignored as well. One of the way to tackle this is to use topic modeling, i.e. representing the words in a topic vector. This package provides the following ways to model the topics:

  • LDA (Latent Dirichlet Allocation)
  • LSI (Latent Semantic Indexing)
  • RP (Random Projections)
  • Autoencoder

With the topic representations, users can use any supervised learning algorithm provided by scikit-learn to perform the classification.

Topic Models in gensim: LDA, LSI, and Random Projections

This package supports three algorithms provided by gensim, namely, LDA, LSI, and Random Projections, to do the topic modeling.

>>> import shorttext

First, load a set of training data (all NIH data in this example):

>>> trainclassdict = shorttext.data.nihreports(sample_size=None)

Initialize an instance of topic modeler, and use LDA as an example:

>>> topicmodeler = shorttext.generators.LDAModeler()

For other algorithms, user can use LSIModeler for LSI or RPModeler for RP. Everything else is the same. To train with 128 topics, enter:

>>> topicmodeler.train(trainclassdict, 128)

After the training is done, the user can retrieve the topic vector representation with the trained model. For example,

>>> topicmodeler.retrieve_topicvec('stem cell research')
>>> topicmodeler.retrieve_topicvec('bioinformatics')

By default, the vectors are normalized. Another way to retrieve the topic vector representation is as follow:

>>> topicmodeler['stem cell research']
>>> topicmodeler['bioinformatics']

In the training and the retrieval above, the same preprocessing process is applied. Users can provide their own preprocessor while initiating the topic modeler.

Users can save the trained model by calling:

>>> topicmodeler.save_compact_model('/path/to/nihlda128.bin')

And the topic model can be retrieved by calling:

>>> topicmodeler2 = shorttext.generators.load_gensimtopicmodel('/path/to/nihlda128.bin')

While initialize the instance of the topic modeler, the user can also specify whether to weigh the terms using tf-idf (term frequency - inverse document frequency). The default is to weigh. To not weigh, initialize it as

>>> topicmodeler3 = shorttext.generators.GensimTopicModeler(toweigh=False)

Appendix: Model I/O in Previous Versions

For previous versions of shorttext, the trained models are saved by calling:

>>> topicmodeler.savemodel('/path/to/nihlda128')

However, we discourage users using this anymore, because the model I/O for various models in gensim have been different. It produces errors.

All of them have to be present in order to be loaded. Note that the preprocessor is not saved. To load the model, enter:

>>> topicmodeler2 = shorttext.classifiers.load_gensimtopicmodel('/path/to/nihlda128', compact=False)
class shorttext.generators.bow.GensimTopicModeling.GensimTopicModeler(preprocessor=<function text_preprocessor.<locals>.<lambda>>, algorithm='lda', toweigh=True, normalize=True)

This class facilitates the creation of topic models (options: LDA (latent Dirichlet Allocation), LSI (latent semantic indexing), and Random Projections with the given short text training data, and convert future short text into topic vectors using the trained topic model.

No compact model I/O available for this class. Refer to LDAModeler and LSIModeler.

This class extends LatentTopicModeler.

get_batch_cos_similarities(shorttext)

Calculate the score, which is the cosine similarity with the topic vector of the model, of the short text against each class labels.

If neither train() nor loadmodel() was run, it will raise ModelNotTrainedException.

Parameters:shorttext (str) – short text
Returns:dictionary of scores of the text to all classes
Raise:ModelNotTrainedException
Return type:dict
loadmodel(nameprefix)

Load the topic model with the given prefix of the file paths.

Given the prefix of the file paths, load the corresponding topic model. The files include a JSON (.json) file that specifies various parameters, a gensim dictionary (.gensimdict), and a topic model (.gensimmodel). If weighing is applied, load also the tf-idf model (.gensimtfidf).

Parameters:nameprefix (str) – prefix of the file paths
Returns:None
retrieve_corpus_topicdist(shorttext)

Calculate the topic vector representation of the short text, in the corpus form.

If neither train() nor loadmodel() was run, it will raise ModelNotTrainedException.

Parameters:shorttext (str) – text to be represented
Returns:topic vector in the corpus form
Raise:ModelNotTrainedException
Return type:list
retrieve_topicvec(shorttext)

Calculate the topic vector representation of the short text.

This function calls retrieve_corpus_topicdist().

If neither train() nor loadmodel() was run, it will raise ModelNotTrainedException.

Parameters:shorttext (str) – text to be represented
Returns:topic vector
Raise:ModelNotTrainedException
Return type:numpy.ndarray
savemodel(nameprefix)

Save the model with names according to the prefix.

Given the prefix of the file paths, save the corresponding topic model. The files include a JSON (.json) file that specifies various parameters, a gensim dictionary (.gensimdict), and a topic model (.gensimmodel). If weighing is applied, load also the tf-idf model (.gensimtfidf).

If neither train() nor loadmodel() was run, it will raise ModelNotTrainedException.

Parameters:nameprefix (str) – prefix of the file paths
Returns:None
Raise:ModelNotTrainedException
train(classdict, nb_topics, *args, **kwargs)

Train the topic modeler.

Parameters:
  • classdict (dict) – training data
  • nb_topics (int) – number of latent topics
  • args – arguments to pass to the train method for gensim topic models
  • kwargs – arguments to pass to the train method for gensim topic models
Returns:

None

update(additional_classdict)

Update the model with additional data.

It updates the topic model with additional data.

Warning: It does not allow adding class labels, and new words. The dictionary is not changed. Therefore, such an update will alter the topic model only. It affects the topic vector representation. While the corpus is changed, the words pumped into calculating the similarity matrix is not changed.

Therefore, this function means for a fast update. But if you want a comprehensive model, it is recommended to retrain.

Parameters:additional_classdict (dict) – additional training data
Returns:None
class shorttext.generators.bow.GensimTopicModeling.LDAModeler(preprocessor=<function text_preprocessor.<locals>.<lambda>>, toweigh=True, normalize=True)

This class facilitates the creation of LDA (latent Dirichlet Allocation) topic models, with the given short text training data, and convert future short text into topic vectors using the trained topic model.

This class extends GensimTopicModeler.

class shorttext.generators.bow.GensimTopicModeling.LSIModeler(preprocessor=<function text_preprocessor.<locals>.<lambda>>, toweigh=True, normalize=True)

This class facilitates the creation of LSI (latent semantic indexing) topic models, with the given short text training data, and convert future short text into topic vectors using the trained topic model.

This class extends GensimTopicModeler.

class shorttext.generators.bow.GensimTopicModeling.RPModeler(preprocessor=<function text_preprocessor.<locals>.<lambda>>, toweigh=True, normalize=True)

This class facilitates the creation of RP (random projection) topic models, with the given short text training data, and convert future short text into topic vectors using the trained topic model.

This class extends GensimTopicModeler.

shorttext.generators.bow.GensimTopicModeling.load_gensimtopicmodel(name, preprocessor=<function text_preprocessor.<locals>.<lambda>>, compact=True)

Load the gensim topic modeler from files.

Parameters:
  • name (str) – name (if compact=True) or prefix (if compact=False) of the file path
  • preprocessor (function) – function that preprocesses the text. (Default: shorttext.utils.textpreprocess.standard_text_preprocessor_1)
  • compact (bool) – whether model file is compact (Default: True)
Returns:

a topic modeler

Return type:

GensimTopicModeler

AutoEncoder

Note: Previous version (<=0.2.1) of this autoencoder has a serious bug. Current version is incompatible with the autoencoder of version <=0.2.1 .

Another way to find a new topic vector representation is to use the autoencoder, a neural network model which compresses a vector representation into another one of a shorter (or longer, rarely though) representation, by minimizing the difference between the input layer and the decoding layer. For faster demonstration, use the subject keywords as the example dataset:

>>> subdict = shorttext.data.subjectkeywords()

To train such a model, we perform in a similar way with the LDA model (or LSI and random projections above):

>>> autoencoder = shorttext.generators.AutoencodingTopicModeler()
>>> autoencoder.train(subdict, 8)

After the training is done, the user can retrieve the encoded vector representation with the trained autoencoder model. For example,

>>> autoencoder.retrieve_topicvec('linear algebra')
>>> autoencoder.retrieve_topicvec('path integral')

By default, the vectors are normalized. Another way to retrieve the topic vector representation is as follow:

>>> autoencoder['linear algebra']
>>> autoencoder['path integral']

In the training and the retrieval above, the same preprocessing process is applied. Users can provide their own preprocessor while initiating the topic modeler.

Users can save the trained models, by calling:

>>> autoencoder.save_compact_model('/path/to/sub_autoencoder8.bin')

And the model can be retrieved by calling:

>>> autoencoder2 = shorttext.generators.load_autoencoder_topicmodel('/path/to/sub_autoencoder8.bin')

Like other topic models, while initialize the instance of the topic modeler, the user can also specify whether to weigh the terms using tf-idf (term frequency - inverse document frequency). The default is to weigh. To not weigh, initialize it as:

>>> autoencoder3 = shorttext.generators.AutoencodingTopicModeler(toweigh=False)
class shorttext.generators.bow.AutoEncodingTopicModeling.AutoencodingTopicModeler(preprocessor=<function text_preprocessor.<locals>.<lambda>>, normalize=True)

This class facilitates the topic modeling of input training data using the autoencoder.

A reference about how an autoencoder is written with keras by Francois Chollet, titled Building Autoencoders in Keras .

This class extends LatentTopicModeler.

get_batch_cos_similarities(shorttext)

Calculate the score, which is the cosine similarity with the topic vector of the model, of the short text against each class labels.

If neither train() nor loadmodel() was run, it will raise ModelNotTrainedException.

Parameters:shorttext (str) – short text
Returns:dictionary of scores of the text to all classes
Raise:ModelNotTrainedException
Return type:dict
loadmodel(nameprefix, load_incomplete=False)

Save the model with names according to the prefix.

Given the prefix of the file paths, load the model into files, with name given by the prefix. There are files with names ending with “_encoder.json” and “_encoder.h5”, which are the JSON and HDF5 files for the encoder respectively. They also include a gensim dictionary (.gensimdict).

Parameters:
  • nameprefix (str) – prefix of the paths of the file
  • load_incomplete (bool) – load encoder only, not decoder and autoencoder file (Default: False; put True for model built in version <= 0.2.1)
Returns:

None

precalculate_liststr_topicvec(shorttexts)

Calculate the summed topic vectors for training data for each class.

This function is called while training.

Parameters:shorttexts (list) – list of short texts
Returns:average topic vector
Raise:ModelNotTrainedException
Return type:numpy.ndarray
retrieve_topicvec(shorttext)

Calculate the topic vector representation of the short text.

If neither train() nor loadmodel() was run, it will raise ModelNotTrainedException.

Parameters:shorttext (str) – short text
Returns:encoded vector representation of the short text
Raise:ModelNotTrainedException
Return type:numpy.ndarray
savemodel(nameprefix, save_complete_autoencoder=True)

Save the model with names according to the prefix.

Given the prefix of the file paths, save the model into files, with name given by the prefix. There are files with names ending with “_encoder.json” and “_encoder.h5”, which are the JSON and HDF5 files for the encoder respectively. They also include a gensim dictionary (.gensimdict).

If save_complete_autoencoder is True, then there are also files with names ending with “_decoder.json” and “_decoder.h5”.

If neither train() nor loadmodel() was run, it will raise ModelNotTrainedException.

Parameters:
  • nameprefix (str) – prefix of the paths of the file
  • save_complete_autoencoder (bool) – whether to store the decoder and the complete autoencoder (Default: True; but False for version <= 0.2.1)
Returns:

None

train(classdict, nb_topics, *args, **kwargs)

Train the autoencoder.

Parameters:
  • classdict (dict) – training data
  • nb_topics (int) – number of topics, i.e., the number of encoding dimensions
  • args – arguments to be passed to keras model fitting
  • kwargs – arguments to be passed to keras model fitting
Returns:

None

shorttext.generators.bow.AutoEncodingTopicModeling.load_autoencoder_topicmodel(name, preprocessor=<function text_preprocessor.<locals>.<lambda>>, compact=True)

Load the autoencoding topic model from files.

Parameters:
  • name (str) – name (if compact=True) or prefix (if compact=False) of the paths of the model files
  • preprocessor (function) – function that preprocesses the text. (Default: shorttext.utils.textpreprocess.standard_text_preprocessor_1)
  • compact (bool) – whether model file is compact (Default: True)
Returns:

an autoencoder as a topic modeler

Return type:

generators.bow.AutoEncodingTopicModeling.AutoencodingTopicModeler

Appendix: Unzipping Model I/O

For previous versions of shorttext, the trained models are saved by calling:

>>> autoencoder.savemodel('/path/to/sub_autoencoder8')

The following files are produced for the autoencoder:

/path/to/sub_autoencoder.json
/path/to/sub_autoencoder.gensimdict
/path/to/sub_autoencoder_encoder.json
/path/to/sub_autoencoder_encoder.h5
/path/to/sub_autoencoder_classtopicvecs.pkl

If specifying save_complete_autoencoder=True, then four more files are found:

/path/to/sub_autoencoder_decoder.json
/path/to/sub_autoencoder_decoder.h5
/path/to/sub_autoencoder_autoencoder.json
/path/to/sub_autoencoder_autoencoder.h5

Users can load the same model later by entering:

>>> autoencoder2 = shorttext.classifiers.load_autoencoder_topic('/path/to/sub_autoencoder8', compact=False)

Abstract Latent Topic Modeling Class

Both shorttext.generators.GensimTopicModeler and shorttext.generators.AutoencodingTopicModeler extends shorttext.generators.bow.LatentTopicModeling.LatentTopicModeler, an abstract class virtually. If user wants to develop its own topic model that extends this, he has to define the methods train, retrieve_topic_vec, loadmodel, and savemodel.

class shorttext.generators.bow.LatentTopicModeling.LatentTopicModeler(preprocessor=<function text_preprocessor.<locals>.<lambda>>, normalize=True)

Abstract class for various topic modeler.

generate_corpus(classdict)

Calculate the gensim dictionary and corpus, and extract the class labels from the training data. Called by train().

Parameters:classdict (dict) – training data
Returns:None
get_batch_cos_similarities(shorttext)

Calculate the cosine similarities of the given short text and all the class labels.

This is an abstract method of this abstract class, which raise the NotImplementedException.

Parameters:shorttext (str) – short text
Returns:topic vector
Raise:NotImplementedException
Return type:numpy.ndarray
loadmodel(nameprefix)

Load the model from files.

This is an abstract method of this abstract class, which raise the NotImplementedException.

Parameters:nameprefix (str) – prefix of the paths of the model files
Returns:None
Raise:NotImplementedException
retrieve_bow(shorttext)

Calculate the gensim bag-of-words representation of the given short text.

Parameters:shorttext (str) – text to be represented
Returns:corpus representation of the text
Return type:list
retrieve_bow_vector(shorttext, normalize=True)

Calculate the vector representation of the bag-of-words in terms of numpy.ndarray.

Parameters:
  • shorttext (str) – short text
  • normalize (bool) – whether the retrieved topic vectors are normalized. (Default: True)
Returns:

vector represtation of the text

Return type:

numpy.ndarray

retrieve_topicvec(shorttext)

Calculate the topic vector representation of the short text.

This is an abstract method of this abstract class, which raise the NotImplementedException.

Parameters:shorttext (str) – short text
Returns:topic vector
Raise:NotImplementedException
Return type:numpy.ndarray
savemodel(nameprefix)

Save the model to files.

This is an abstract method of this abstract class, which raise the NotImplementedException.

Parameters:nameprefix (str) – prefix of the paths of the model files
Returns:None
Raise:NotImplementedException
train(classdict, nb_topics, *args, **kwargs)

Train the modeler.

This is an abstract method of this abstract class, which raise the NotImplementedException.

Parameters:
  • classdict (dict) – training data
  • nb_topics (int) – number of latent topics
  • args – arguments to be passed into the wrapped training functions
  • kwargs – arguments to be passed into the wrapped training functions
Returns:

None

Raise:

NotImplementedException

class shorttext.generators.bow.GensimTopicModeling.GensimTopicModeler(preprocessor=<function text_preprocessor.<locals>.<lambda>>, algorithm='lda', toweigh=True, normalize=True)

This class facilitates the creation of topic models (options: LDA (latent Dirichlet Allocation), LSI (latent semantic indexing), and Random Projections with the given short text training data, and convert future short text into topic vectors using the trained topic model.

No compact model I/O available for this class. Refer to LDAModeler and LSIModeler.

This class extends LatentTopicModeler.

get_batch_cos_similarities(shorttext)

Calculate the score, which is the cosine similarity with the topic vector of the model, of the short text against each class labels.

If neither train() nor loadmodel() was run, it will raise ModelNotTrainedException.

Parameters:shorttext (str) – short text
Returns:dictionary of scores of the text to all classes
Raise:ModelNotTrainedException
Return type:dict
loadmodel(nameprefix)

Load the topic model with the given prefix of the file paths.

Given the prefix of the file paths, load the corresponding topic model. The files include a JSON (.json) file that specifies various parameters, a gensim dictionary (.gensimdict), and a topic model (.gensimmodel). If weighing is applied, load also the tf-idf model (.gensimtfidf).

Parameters:nameprefix (str) – prefix of the file paths
Returns:None
retrieve_corpus_topicdist(shorttext)

Calculate the topic vector representation of the short text, in the corpus form.

If neither train() nor loadmodel() was run, it will raise ModelNotTrainedException.

Parameters:shorttext (str) – text to be represented
Returns:topic vector in the corpus form
Raise:ModelNotTrainedException
Return type:list
retrieve_topicvec(shorttext)

Calculate the topic vector representation of the short text.

This function calls retrieve_corpus_topicdist().

If neither train() nor loadmodel() was run, it will raise ModelNotTrainedException.

Parameters:shorttext (str) – text to be represented
Returns:topic vector
Raise:ModelNotTrainedException
Return type:numpy.ndarray
savemodel(nameprefix)

Save the model with names according to the prefix.

Given the prefix of the file paths, save the corresponding topic model. The files include a JSON (.json) file that specifies various parameters, a gensim dictionary (.gensimdict), and a topic model (.gensimmodel). If weighing is applied, load also the tf-idf model (.gensimtfidf).

If neither train() nor loadmodel() was run, it will raise ModelNotTrainedException.

Parameters:nameprefix (str) – prefix of the file paths
Returns:None
Raise:ModelNotTrainedException
train(classdict, nb_topics, *args, **kwargs)

Train the topic modeler.

Parameters:
  • classdict (dict) – training data
  • nb_topics (int) – number of latent topics
  • args – arguments to pass to the train method for gensim topic models
  • kwargs – arguments to pass to the train method for gensim topic models
Returns:

None

update(additional_classdict)

Update the model with additional data.

It updates the topic model with additional data.

Warning: It does not allow adding class labels, and new words. The dictionary is not changed. Therefore, such an update will alter the topic model only. It affects the topic vector representation. While the corpus is changed, the words pumped into calculating the similarity matrix is not changed.

Therefore, this function means for a fast update. But if you want a comprehensive model, it is recommended to retrain.

Parameters:additional_classdict (dict) – additional training data
Returns:None
class shorttext.generators.bow.GensimTopicModeling.LDAModeler(preprocessor=<function text_preprocessor.<locals>.<lambda>>, toweigh=True, normalize=True)

This class facilitates the creation of LDA (latent Dirichlet Allocation) topic models, with the given short text training data, and convert future short text into topic vectors using the trained topic model.

This class extends GensimTopicModeler.

class shorttext.generators.bow.GensimTopicModeling.LSIModeler(preprocessor=<function text_preprocessor.<locals>.<lambda>>, toweigh=True, normalize=True)

This class facilitates the creation of LSI (latent semantic indexing) topic models, with the given short text training data, and convert future short text into topic vectors using the trained topic model.

This class extends GensimTopicModeler.

class shorttext.generators.bow.GensimTopicModeling.RPModeler(preprocessor=<function text_preprocessor.<locals>.<lambda>>, toweigh=True, normalize=True)

This class facilitates the creation of RP (random projection) topic models, with the given short text training data, and convert future short text into topic vectors using the trained topic model.

This class extends GensimTopicModeler.

shorttext.generators.bow.GensimTopicModeling.load_gensimtopicmodel(name, preprocessor=<function text_preprocessor.<locals>.<lambda>>, compact=True)

Load the gensim topic modeler from files.

Parameters:
  • name (str) – name (if compact=True) or prefix (if compact=False) of the file path
  • preprocessor (function) – function that preprocesses the text. (Default: shorttext.utils.textpreprocess.standard_text_preprocessor_1)
  • compact (bool) – whether model file is compact (Default: True)
Returns:

a topic modeler

Return type:

GensimTopicModeler

Appendix: Namespaces for Topic Modeler in Previous Versions

All generative topic modeling algorithms were placed under the package shorttext.classifiers for version <=0.3.4. In current version (>= 0.3.5), however, all generative models will be moved to shorttext.generators, while any classifiers making use of these topic models are still kept under shorttext.classifiers. A list include:

shorttext.classifiers.GensimTopicModeler  ->  shorttext.generators.GensimTopicModeler
shorttext.classifiers.LDAModeler  ->  shorttext.generators.LDAModeler
shorttext.classifiers.LSIModeler  ->  shorttext.generators.LSIModeler
shorttext.classifiers.RPModeler  ->  shorttext.generators.RPModeler
shorttext.classifiers.AutoencodingTopicModeler  ->  shorttext.generators.AutoencodingTopicModeler
shorttext.classifiers.load_gensimtopicmodel  ->  shorttext.generators.load_gensimtopicmodel
shorttext.classifiers.load_autoencoder_topic  ->  shorttext.generators.load_autoencoder_topicmodel

Before release 0.5.6, for backward compatibility, developers can still call the topic models as if there were no such changes, although they are advised to make this change. However, effective release 0.5.7, this backward compatibility is no longer available.

Classification Using Cosine Similarity

The topic modelers are trained to represent the short text in terms of a topic vector, effectively the feature vector. However, to perform supervised classification, there needs a classification algorithm. The first one is to calculate the cosine similarities between topic vectors of the given short text with those of the texts in all class labels.

If there is already a trained topic modeler, whether it is shorttext.generators.GensimTopicModeler or shorttext.generators.AutoencodingTopicModeler, a classifier based on cosine similarities can be initiated immediately without training. Taking the LDA example above, such classifier can be initiated as follow:

>>> cos_classifier = shorttext.classifiers.TopicVectorCosineDistanceClassifier(topicmodeler)

Or if the user already saved the topic modeler, one can initiate the same classifier by loading the topic modeler:

>>> cos_classifier = shorttext.classifiers.load_gensimtopicvec_cosineClassifier('/path/to/nihlda128.bin')

To perform prediction, enter:

>>> cos_classifier.score('stem cell research')

which outputs a dictionary with labels and the corresponding scores.

The same thing for autoencoder, but the classifier based on autoencoder can be loaded by another function:

>>> cos_classifier = shorttext.classifiers.load_autoencoder_cosineClassifier('/path/to/sub_autoencoder8.bin')
class shorttext.classifiers.bow.topic.TopicVectorDistanceClassification.TopicVecCosineDistanceClassifier(topicmodeler)

This is a class that implements a classifier that perform classification based on the cosine similarity between the topic vectors of the user-input short texts and various classes. The topic vectors are calculated using LatentTopicModeler.

loadmodel(nameprefix)

Load the topic model with the given prefix of the file paths.

Given the prefix of the file paths, load the corresponding topic model. The files include a JSON (.json) file that specifies various parameters, a gensim dictionary (.gensimdict), and a topic model (.gensimmodel). If weighing is applied, load also the tf-idf model (.gensimtfidf).

This is essentialing loading the topic modeler LatentTopicModeler.

Parameters:nameprefix (str) – prefix of the file paths
Returns:None
savemodel(nameprefix)

Save the model with names according to the prefix.

Given the prefix of the file paths, save the corresponding topic model. The files include a JSON (.json) file that specifies various parameters, a gensim dictionary (.gensimdict), and a topic model (.gensimmodel). If weighing is applied, load also the tf-idf model (.gensimtfidf).

If neither train() nor loadmodel() was run, it will raise ModelNotTrainedException.

This is essentialing saving the topic modeler LatentTopicModeler.

Parameters:nameprefix (str) – prefix of the file paths
Returns:None
Raise:ModelNotTrainedException
score(shorttext)

Calculate the score, which is the cosine similarity with the topic vector of the model, of the short text against each class labels.

Parameters:shorttext (str) – short text
Returns:dictionary of scores of the text to all classes
Return type:dict
shorttext.classifiers.bow.topic.TopicVectorDistanceClassification.load_autoencoder_cosineClassifier(name, preprocessor=<function text_preprocessor.<locals>.<lambda>>, compact=True)

Load an autoencoder from files for topic modeling, and return a cosine classifier.

Given the prefix of the file paths, load the model into files, with name given by the prefix. There are files with names ending with “_encoder.json” and “_encoder.h5”, which are the JSON and HDF5 files for the encoder respectively. They also include a gensim dictionary (.gensimdict).

Parameters:
  • name (str) – name (if compact=True) or prefix (if compact=False) of the file paths
  • preprocessor (function) – function that preprocesses the text. (Default: utils.textpreprocess.standard_text_preprocessor_1)
  • compact (bool) – whether model file is compact (Default: True)
Returns:

a classifier that scores the short text based on the autoencoder

Return type:

TopicVecCosineDistanceClassifier

shorttext.classifiers.bow.topic.TopicVectorDistanceClassification.load_gensimtopicvec_cosineClassifier(name, preprocessor=<function text_preprocessor.<locals>.<lambda>>, compact=True)

Load a gensim topic model from files and return a cosine distance classifier.

Given the prefix of the files of the topic model, return a cosine distance classifier based on this model, i.e., TopicVecCosineDistanceClassifier.

The files include a JSON (.json) file that specifies various parameters, a gensim dictionary (.gensimdict), and a topic model (.gensimmodel). If weighing is applied, load also the tf-idf model (.gensimtfidf).

Parameters:
  • name (str) – name (if compact=True) or prefix (if compact=False) of the file paths
  • preprocessor (function) – function that preprocesses the text. (Default: utils.textpreprocess.standard_text_preprocessor_1)
  • compact (bool) – whether model file is compact (Default: True)
Returns:

a classifier that scores the short text based on the topic model

Return type:

TopicVecCosineDistanceClassifier

shorttext.classifiers.bow.topic.TopicVectorDistanceClassification.train_autoencoder_cosineClassifier(classdict, nb_topics, preprocessor=<function text_preprocessor.<locals>.<lambda>>, normalize=True, *args, **kwargs)

Return a cosine distance classifier, i.e., TopicVecCosineDistanceClassifier, while training an autoencoder as a topic model in between.

Parameters:
  • classdict (dict) – training data
  • nb_topics (int) – number of topics, i.e., number of encoding dimensions
  • preprocessor (function) – function that preprocesses the text. (Default: utils.textpreprocess.standard_text_preprocessor_1)
  • normalize (bool) – whether the retrieved topic vectors are normalized. (Default: True)
  • args – arguments to be passed to keras model fitting
  • kwargs – arguments to be passed to keras model fitting
Returns:

a classifier that scores the short text based on the autoencoder

Return type:

TopicVecCosineDistanceClassifier

shorttext.classifiers.bow.topic.TopicVectorDistanceClassification.train_gensimtopicvec_cosineClassifier(classdict, nb_topics, preprocessor=<function text_preprocessor.<locals>.<lambda>>, algorithm='lda', toweigh=True, normalize=True, *args, **kwargs)

Return a cosine distance classifier, i.e., TopicVecCosineDistanceClassifier, while training a gensim topic model in between.

Parameters:
  • classdict (dict) – training data
  • nb_topics (int) – number of latent topics
  • preprocessor (function) – function that preprocesses the text. (Default: utils.textpreprocess.standard_text_preprocessor_1)
  • algorithm (str) – algorithm for topic modeling. Options: lda, lsi, rp. (Default: lda)
  • toweigh (bool) – whether to weigh the words using tf-idf. (Default: True)
  • normalize (bool) – whether the retrieved topic vectors are normalized. (Default: True)
  • args – arguments to pass to the train method for gensim topic models
  • kwargs – arguments to pass to the train method for gensim topic models
Returns:

a classifier that scores the short text based on the topic model

Return type:

TopicVecCosineDistanceClassifier

Classification Using Scikit-Learn Classifiers

The topic modeler can be used to generate features used for other machine learning algorithms. We can take any supervised learning algorithms in scikit-learn here. We use Gaussian naive Bayes as an example. For faster demonstration, use the subject keywords as the example dataset.

>>> subtopicmodeler = shorttext.generators.LDAModeler()
>>> subtopicmodeler.train(subdict, 8)

We first import the class:

>>> from sklearn.naive_bayes import GaussianNB

And we train the classifier:

>>> classifier = shorttext.classifiers.TopicVectorSkLearnClassifier(subtopicmodeler, GaussianNB())
>>> classifier.train(subdict)

Predictions can be performed like the following example:

>>> classifier.score('functional integral')

which outputs a dictionary with labels and the corresponding scores.

You can save the model by:

>>> classifier.save_compact_model('/path/to/sublda8nb.bin')

where the argument specifies the prefix of the path of the model files, including the topic models, and the scikit-learn model files. The classifier can be loaded by calling:

>>> classifier2 = shorttext.classifiers.load_gensim_topicvec_sklearnclassifier('/path/to/sublda8nb.bin')

The topic modeler here can also be an autoencoder, by putting subtopicmodeler as the autoencoder will still do the work. However, to load the saved classifier with an autoencoder model, do

>>> classifier2 = shorttext.classifiers.load_autoencoder_topic_sklearnclassifier('/path/to/filename.bin')

Compact model files saved by TopicVectorSkLearnClassifier in shorttext >= 1.0.0 cannot be read by earlier version of shorttext; vice versa is not true though: old compact model files can be read in.

class shorttext.classifiers.bow.topic.SkLearnClassification.TopicVectorSkLearnClassifier(topicmodeler, sklearn_classifier)

This is a classifier that wraps any supervised learning algorithm in scikit-learn, and use the topic vectors output by the topic modeler LatentTopicModeler that wraps the topic models in gensim.

# Reference

Xuan Hieu Phan, Cam-Tu Nguyen, Dieu-Thu Le, Minh Le Nguyen, Susumu Horiguchi, Quang-Thuy Ha, “A Hidden Topic-Based Framework toward Building Applications with Short Web Documents,” IEEE Trans. Knowl. Data Eng. 23(7): 961-976 (2011).

Xuan Hieu Phan, Le-Minh Nguyen, Susumu Horiguchi, “Learning to Classify Short and Sparse Text & Web withHidden Topics from Large-scale Data Collections,” WWW ‘08 Proceedings of the 17th international conference on World Wide Web. (2008) [ACL]

classify(shorttext)

Give the highest-scoring class of the given short text according to the classifier.

If neither train() nor loadmodel() was run, or if the topic model was not trained, it will raise ModelNotTrainedException.

Parameters:shorttext (str) – short text
Returns:class label of the classification result of the given short text
Raise:ModelNotTrainedException
Return type:str
getvector(shorttext)

Retrieve the topic vector representation of the given short text.

If the topic modeler does not have a trained model, it will raise ModelNotTrainedException.

Parameters:shorttext (str) – short text
Returns:topic vector representation
Raise:ModelNotTrainedException
Return type:numpy.ndarray
load_compact_model(name)

Load the classification model together with the topic model from a compact file.

Parameters:name (str) – name of the compact model file
Returns:None
loadmodel(nameprefix)

Load the classification model together with the topic model.

Parameters:nameprefix (str) – prefix of the paths of the model files
Returns:None
save_compact_model(name)

Save the model.

Save the topic model and the trained scikit-learn classification model in one compact model file.

If neither train() nor loadmodel() was run, or if the topic model was not trained, it will raise ModelNotTrainedException.

Parameters:name (str) – name of the compact model file
Returns:None
savemodel(nameprefix)

Save the model.

Save the topic model and the trained scikit-learn classification model. The scikit-learn model will have the name nameprefix followed by the extension .pkl. The topic model is the same as the one in LatentTopicModeler.

If neither train() nor loadmodel() was run, or if the topic model was not trained, it will raise ModelNotTrainedException.

Parameters:nameprefix (str) – prefix of the paths of the model files
Returns:None
Raise:ModelNotTrainedException
score(shorttext)

Calculate the score, which is the cosine similarity with the topic vector of the model, of the short text against each class labels.

If neither train() nor loadmodel() was run, or if the topic model was not trained, it will raise ModelNotTrainedException.

Parameters:shorttext (str) – short text
Returns:dictionary of scores of the text to all classes
Raise:ModelNotTrainedException
Return type:dict
train(classdict, *args, **kwargs)

Train the classifier.

If the topic modeler does not have a trained model, it will raise ModelNotTrainedException.

Parameters:
  • classdict (dict) – training data
  • args – arguments to be passed to the fit method of the scikit-learn classifier
  • kwargs – arguments to be passed to the fit method of the scikit-learn classifier
Returns:

None

Raise:

ModelNotTrainedException

shorttext.classifiers.bow.topic.SkLearnClassification.load_autoencoder_topic_sklearnclassifier(name, preprocessor=<function text_preprocessor.<locals>.<lambda>>, compact=True)
Load the classifier, a wrapper that uses scikit-learn classifier, with
feature vectors given by an autocoder topic model, from files.

# Reference

Xuan Hieu Phan, Cam-Tu Nguyen, Dieu-Thu Le, Minh Le Nguyen, Susumu Horiguchi, Quang-Thuy Ha, “A Hidden Topic-Based Framework toward Building Applications with Short Web Documents,” IEEE Trans. Knowl. Data Eng. 23(7): 961-976 (2011).

Xuan Hieu Phan, Le-Minh Nguyen, Susumu Horiguchi, “Learning to Classify Short and Sparse Text & Web withHidden Topics from Large-scale Data Collections,” WWW ‘08 Proceedings of the 17th international conference on World Wide Web. (2008) [ACL]

Parameters:
  • name (str) – name (if compact==True) or prefix (if compact==False) of the paths of model files
  • preprocessor (function) – function that preprocesses the text (Default: utils.textpreprocess.standard_text_preprocessor_1)
  • compact (bool) – whether model file is compact (Default: True)
Returns:

a trained classifier

Return type:

TopicVectorSkLearnClassifier

shorttext.classifiers.bow.topic.SkLearnClassification.load_gensim_topicvec_sklearnclassifier(name, preprocessor=<function text_preprocessor.<locals>.<lambda>>, compact=True)
Load the classifier, a wrapper that uses scikit-learn classifier, with
feature vectors given by a topic model, from files.

# Reference

Xuan Hieu Phan, Cam-Tu Nguyen, Dieu-Thu Le, Minh Le Nguyen, Susumu Horiguchi, Quang-Thuy Ha, “A Hidden Topic-Based Framework toward Building Applications with Short Web Documents,” IEEE Trans. Knowl. Data Eng. 23(7): 961-976 (2011).

Xuan Hieu Phan, Le-Minh Nguyen, Susumu Horiguchi, “Learning to Classify Short and Sparse Text & Web withHidden Topics from Large-scale Data Collections,” WWW ‘08 Proceedings of the 17th international conference on World Wide Web. (2008) [ACL]

Parameters:
  • name (str) – name (if compact==True) or prefix (if compact==False) of the paths of model files
  • preprocessor (function) – function that preprocesses the text (Default: utils.textpreprocess.standard_text_preprocessor_1)
  • compact (bool) – whether model file is compact (Default: True)
Returns:

a trained classifier

Return type:

TopicVectorSkLearnClassifier

shorttext.classifiers.bow.topic.SkLearnClassification.train_autoencoder_topic_sklearnclassifier(classdict, nb_topics, sklearn_classifier, preprocessor=<function text_preprocessor.<locals>.<lambda>>, normalize=True, keras_paramdict={}, sklearn_paramdict={})

Train the supervised learning classifier, with features given by topic vectors.

It trains an autoencoder topic model, and with its encoded vector representation, train a supervised learning classifier. The instantiated (not trained) scikit-learn classifier must be passed into the argument.

# Reference

Xuan Hieu Phan, Cam-Tu Nguyen, Dieu-Thu Le, Minh Le Nguyen, Susumu Horiguchi, Quang-Thuy Ha, “A Hidden Topic-Based Framework toward Building Applications with Short Web Documents,” IEEE Trans. Knowl. Data Eng. 23(7): 961-976 (2011).

Xuan Hieu Phan, Le-Minh Nguyen, Susumu Horiguchi, “Learning to Classify Short and Sparse Text & Web withHidden Topics from Large-scale Data Collections,” WWW ‘08 Proceedings of the 17th international conference on World Wide Web. (2008) [ACL]

Parameters:
  • classdict (dict) – training data
  • nb_topics (int) – number topics, i.e., number of encoding dimensions
  • sklearn_classifier (sklearn.base.BaseEstimator) – instantiated scikit-learn classifier
  • preprocessor (function) – function that preprocesses the text (Default: utils.textpreprocess.standard_text_preprocessor_1)
  • normalize (bool) – whether the retrieved topic vectors are normalized (Default: True)
  • keras_paramdict – arguments to be passed to keras for training autoencoder
  • sklearn_paramdict – arguemtnst to be passed to scikit-learn for fitting the classifier
Returns:

a trained classifier

Return type:

TopicVectorSkLearnClassifier

shorttext.classifiers.bow.topic.SkLearnClassification.train_gensim_topicvec_sklearnclassifier(classdict, nb_topics, sklearn_classifier, preprocessor=<function text_preprocessor.<locals>.<lambda>>, topicmodel_algorithm='lda', toweigh=True, normalize=True, gensim_paramdict={}, sklearn_paramdict={})

Train the supervised learning classifier, with features given by topic vectors.

It trains a topic model, and with its topic vector representation, train a supervised learning classifier. The instantiated (not trained) scikit-learn classifier must be passed into the argument.

# Reference

Xuan Hieu Phan, Cam-Tu Nguyen, Dieu-Thu Le, Minh Le Nguyen, Susumu Horiguchi, Quang-Thuy Ha, “A Hidden Topic-Based Framework toward Building Applications with Short Web Documents,” IEEE Trans. Knowl. Data Eng. 23(7): 961-976 (2011).

Xuan Hieu Phan, Le-Minh Nguyen, Susumu Horiguchi, “Learning to Classify Short and Sparse Text & Web withHidden Topics from Large-scale Data Collections,” WWW ‘08 Proceedings of the 17th international conference on World Wide Web. (2008) [ACL]

Parameters:
  • classdict (dict) – training data
  • nb_topics (int) – number of topics in the topic model
  • sklearn_classifier (sklearn.base.BaseEstimator) – instantiated scikit-learn classifier
  • preprocessor (function) – function that preprocesses the text (Default: utils.textpreprocess.standard_text_preprocessor_1)
  • topicmodel_algorithm (str) – topic model algorithm (Default: ‘lda’)
  • toweigh (bool) – whether to weigh the words using tf-idf (Default: True)
  • normalize (bool) – whether the retrieved topic vectors are normalized (Default: True)
  • gensim_paramdict (dict) – arguments to be passed on to the train method of the gensim topic model
  • sklearn_paramdict (dict) – arguments to be passed on to the fit method of the sklearn classification algorithm
Returns:

a trained classifier

Return type:

TopicVectorSkLearnClassifier

Notes about Text Preprocessing

The topic models are based on bag-of-words model, and text preprocessing is very important. However, the text preprocessing step cannot be serialized. The users should keep track of the text preprocessing step on their own. Unless it is necessary, use the standard preprocessing.

See more: Text Preprocessing .

Reference

David M. Blei, “Probabilistic Topic Models,” Communications of the ACM 55(4): 77-84 (2012).

Francois Chollet, “Building Autoencoders in Keras,” The Keras Blog. [Keras]

Xuan Hieu Phan, Cam-Tu Nguyen, Dieu-Thu Le, Minh Le Nguyen, Susumu Horiguchi, Quang-Thuy Ha, “A Hidden Topic-Based Framework toward Building Applications with Short Web Documents,” IEEE Trans. Knowl. Data Eng. 23(7): 961-976 (2011).

Xuan Hieu Phan, Le-Minh Nguyen, Susumu Horiguchi, “Learning to Classify Short and Sparse Text & Web withHidden Topics from Large-scale Data Collections,” WWW ‘08 Proceedings of the 17th international conference on World Wide Web. (2008) [ACL]

Home: Homepage of shorttext