# Supervised Classification with Topics as Features¶

## Topic Vectors as Intermediate Feature Vectors¶

To perform classification using bag-of-words (BOW) model as features, nltk and gensim offered good framework. But the feature vectors of short text represented by BOW can be very sparse. And the relationships between words with similar meanings are ignored as well. One of the way to tackle this is to use topic modeling, i.e. representing the words in a topic vector. This package provides the following ways to model the topics:

• LDA (Latent Dirichlet Allocation)
• LSI (Latent Semantic Indexing)
• RP (Random Projections)
• Autoencoder

With the topic representations, users can use any supervised learning algorithm provided by scikit-learn to perform the classification.

## Topic Models in gensim: LDA, LSI, and Random Projections¶

This package supports three algorithms provided by gensim, namely, LDA, LSI, and Random Projections, to do the topic modeling.

>>> import shorttext


First, load a set of training data (all NIH data in this example):

>>> trainclassdict = shorttext.data.nihreports(sample_size=None)


Initialize an instance of topic modeler, and use LDA as an example:

>>> topicmodeler = shorttext.generators.LDAModeler()


For other algorithms, user can use LSIModeler for LSI or RPModeler for RP. Everything else is the same. To train with 128 topics, enter:

>>> topicmodeler.train(trainclassdict, 128)


After the training is done, the user can retrieve the topic vector representation with the trained model. For example,

>>> topicmodeler.retrieve_topicvec('stem cell research')

>>> topicmodeler.retrieve_topicvec('bioinformatics')


By default, the vectors are normalized. Another way to retrieve the topic vector representation is as follow:

>>> topicmodeler['stem cell research']

>>> topicmodeler['bioinformatics']


In the training and the retrieval above, the same preprocessing process is applied. Users can provide their own preprocessor while initiating the topic modeler.

Users can save the trained model by calling:

>>> topicmodeler.save_compact_model('/path/to/nihlda128.bin')


And the topic model can be retrieved by calling:

>>> topicmodeler2 = shorttext.generators.load_gensimtopicmodel('/path/to/nihlda128.bin')


While initialize the instance of the topic modeler, the user can also specify whether to weigh the terms using tf-idf (term frequency - inverse document frequency). The default is to weigh. To not weigh, initialize it as

>>> topicmodeler3 = shorttext.generators.GensimTopicModeler(toweigh=False)


### Appendix: Model I/O in Previous Versions¶

For previous versions of shorttext, the trained models are saved by calling:

>>> topicmodeler.savemodel('/path/to/nihlda128')


However, we discourage users using this anymore, because the model I/O for various models in gensim have been different. It produces errors.

All of them have to be present in order to be loaded. Note that the preprocessor is not saved. To load the model, enter:

>>> topicmodeler2 = shorttext.classifiers.load_gensimtopicmodel('/path/to/nihlda128', compact=False)

class shorttext.generators.bow.GensimTopicModeling.GensimTopicModeler(preprocessor=<function text_preprocessor.<locals>.<lambda>>, algorithm='lda', toweigh=True, normalize=True)

This class facilitates the creation of topic models (options: LDA (latent Dirichlet Allocation), LSI (latent semantic indexing), and Random Projections with the given short text training data, and convert future short text into topic vectors using the trained topic model.

No compact model I/O available for this class. Refer to LDAModeler and LSIModeler.

This class extends LatentTopicModeler.

get_batch_cos_similarities(shorttext)

Calculate the score, which is the cosine similarity with the topic vector of the model, of the short text against each class labels.

If neither train() nor loadmodel() was run, it will raise ModelNotTrainedException.

Parameters: shorttext (str) – short text dictionary of scores of the text to all classes ModelNotTrainedException dict
loadmodel(nameprefix)

Load the topic model with the given prefix of the file paths.

Given the prefix of the file paths, load the corresponding topic model. The files include a JSON (.json) file that specifies various parameters, a gensim dictionary (.gensimdict), and a topic model (.gensimmodel). If weighing is applied, load also the tf-idf model (.gensimtfidf).

Parameters: nameprefix (str) – prefix of the file paths None
retrieve_corpus_topicdist(shorttext)

Calculate the topic vector representation of the short text, in the corpus form.

If neither train() nor loadmodel() was run, it will raise ModelNotTrainedException.

Parameters: shorttext (str) – text to be represented topic vector in the corpus form ModelNotTrainedException list
retrieve_topicvec(shorttext)

Calculate the topic vector representation of the short text.

This function calls retrieve_corpus_topicdist().

If neither train() nor loadmodel() was run, it will raise ModelNotTrainedException.

Parameters: shorttext (str) – text to be represented topic vector ModelNotTrainedException numpy.ndarray
savemodel(nameprefix)

Save the model with names according to the prefix.

Given the prefix of the file paths, save the corresponding topic model. The files include a JSON (.json) file that specifies various parameters, a gensim dictionary (.gensimdict), and a topic model (.gensimmodel). If weighing is applied, load also the tf-idf model (.gensimtfidf).

If neither train() nor loadmodel() was run, it will raise ModelNotTrainedException.

Parameters: nameprefix (str) – prefix of the file paths None ModelNotTrainedException
train(classdict, nb_topics, *args, **kwargs)

Train the topic modeler.

Parameters: classdict (dict) – training data nb_topics (int) – number of latent topics args – arguments to pass to the train method for gensim topic models kwargs – arguments to pass to the train method for gensim topic models None
update(additional_classdict)

Update the model with additional data.

It updates the topic model with additional data.

Warning: It does not allow adding class labels, and new words. The dictionary is not changed. Therefore, such an update will alter the topic model only. It affects the topic vector representation. While the corpus is changed, the words pumped into calculating the similarity matrix is not changed.

Therefore, this function means for a fast update. But if you want a comprehensive model, it is recommended to retrain.

Parameters: additional_classdict (dict) – additional training data None
class shorttext.generators.bow.GensimTopicModeling.LDAModeler(preprocessor=<function text_preprocessor.<locals>.<lambda>>, toweigh=True, normalize=True)

This class facilitates the creation of LDA (latent Dirichlet Allocation) topic models, with the given short text training data, and convert future short text into topic vectors using the trained topic model.

This class extends GensimTopicModeler.

class shorttext.generators.bow.GensimTopicModeling.LSIModeler(preprocessor=<function text_preprocessor.<locals>.<lambda>>, toweigh=True, normalize=True)

This class facilitates the creation of LSI (latent semantic indexing) topic models, with the given short text training data, and convert future short text into topic vectors using the trained topic model.

This class extends GensimTopicModeler.

class shorttext.generators.bow.GensimTopicModeling.RPModeler(preprocessor=<function text_preprocessor.<locals>.<lambda>>, toweigh=True, normalize=True)

This class facilitates the creation of RP (random projection) topic models, with the given short text training data, and convert future short text into topic vectors using the trained topic model.

This class extends GensimTopicModeler.

shorttext.generators.bow.GensimTopicModeling.load_gensimtopicmodel(name, preprocessor=<function text_preprocessor.<locals>.<lambda>>, compact=True)

Load the gensim topic modeler from files.

Parameters: name (str) – name (if compact=True) or prefix (if compact=False) of the file path preprocessor (function) – function that preprocesses the text. (Default: shorttext.utils.textpreprocess.standard_text_preprocessor_1) compact (bool) – whether model file is compact (Default: True) a topic modeler GensimTopicModeler

## AutoEncoder¶

Note: Previous version (<=0.2.1) of this autoencoder has a serious bug. Current version is incompatible with the autoencoder of version <=0.2.1 .

Another way to find a new topic vector representation is to use the autoencoder, a neural network model which compresses a vector representation into another one of a shorter (or longer, rarely though) representation, by minimizing the difference between the input layer and the decoding layer. For faster demonstration, use the subject keywords as the example dataset:

>>> subdict = shorttext.data.subjectkeywords()


To train such a model, we perform in a similar way with the LDA model (or LSI and random projections above):

>>> autoencoder = shorttext.generators.AutoencodingTopicModeler()
>>> autoencoder.train(subdict, 8)


After the training is done, the user can retrieve the encoded vector representation with the trained autoencoder model. For example,

>>> autoencoder.retrieve_topicvec('linear algebra')

>>> autoencoder.retrieve_topicvec('path integral')


By default, the vectors are normalized. Another way to retrieve the topic vector representation is as follow:

>>> autoencoder['linear algebra']

>>> autoencoder['path integral']


In the training and the retrieval above, the same preprocessing process is applied. Users can provide their own preprocessor while initiating the topic modeler.

Users can save the trained models, by calling:

>>> autoencoder.save_compact_model('/path/to/sub_autoencoder8.bin')


And the model can be retrieved by calling:

>>> autoencoder2 = shorttext.generators.load_autoencoder_topicmodel('/path/to/sub_autoencoder8.bin')


Like other topic models, while initialize the instance of the topic modeler, the user can also specify whether to weigh the terms using tf-idf (term frequency - inverse document frequency). The default is to weigh. To not weigh, initialize it as:

>>> autoencoder3 = shorttext.generators.AutoencodingTopicModeler(toweigh=False)

class shorttext.generators.bow.AutoEncodingTopicModeling.AutoencodingTopicModeler(preprocessor=<function text_preprocessor.<locals>.<lambda>>, normalize=True)

This class facilitates the topic modeling of input training data using the autoencoder.

A reference about how an autoencoder is written with keras by Francois Chollet, titled Building Autoencoders in Keras .

This class extends LatentTopicModeler.

get_batch_cos_similarities(shorttext)

Calculate the score, which is the cosine similarity with the topic vector of the model, of the short text against each class labels.

If neither train() nor loadmodel() was run, it will raise ModelNotTrainedException.

Parameters: shorttext (str) – short text dictionary of scores of the text to all classes ModelNotTrainedException dict
loadmodel(nameprefix, load_incomplete=False)

Save the model with names according to the prefix.

Given the prefix of the file paths, load the model into files, with name given by the prefix. There are files with names ending with “_encoder.json” and “_encoder.h5”, which are the JSON and HDF5 files for the encoder respectively. They also include a gensim dictionary (.gensimdict).

Parameters: nameprefix (str) – prefix of the paths of the file load_incomplete (bool) – load encoder only, not decoder and autoencoder file (Default: False; put True for model built in version <= 0.2.1) None
precalculate_liststr_topicvec(shorttexts)

Calculate the summed topic vectors for training data for each class.

This function is called while training.

Parameters: shorttexts (list) – list of short texts average topic vector ModelNotTrainedException numpy.ndarray
retrieve_topicvec(shorttext)

Calculate the topic vector representation of the short text.

If neither train() nor loadmodel() was run, it will raise ModelNotTrainedException.

Parameters: shorttext (str) – short text encoded vector representation of the short text ModelNotTrainedException numpy.ndarray
savemodel(nameprefix, save_complete_autoencoder=True)

Save the model with names according to the prefix.

Given the prefix of the file paths, save the model into files, with name given by the prefix. There are files with names ending with “_encoder.json” and “_encoder.h5”, which are the JSON and HDF5 files for the encoder respectively. They also include a gensim dictionary (.gensimdict).

If save_complete_autoencoder is True, then there are also files with names ending with “_decoder.json” and “_decoder.h5”.

If neither train() nor loadmodel() was run, it will raise ModelNotTrainedException.

Parameters: nameprefix (str) – prefix of the paths of the file save_complete_autoencoder (bool) – whether to store the decoder and the complete autoencoder (Default: True; but False for version <= 0.2.1) None
train(classdict, nb_topics, *args, **kwargs)

Train the autoencoder.

Parameters: classdict (dict) – training data nb_topics (int) – number of topics, i.e., the number of encoding dimensions args – arguments to be passed to keras model fitting kwargs – arguments to be passed to keras model fitting None
shorttext.generators.bow.AutoEncodingTopicModeling.load_autoencoder_topicmodel(name, preprocessor=<function text_preprocessor.<locals>.<lambda>>, compact=True)

Load the autoencoding topic model from files.

Parameters: name (str) – name (if compact=True) or prefix (if compact=False) of the paths of the model files preprocessor (function) – function that preprocesses the text. (Default: shorttext.utils.textpreprocess.standard_text_preprocessor_1) compact (bool) – whether model file is compact (Default: True) an autoencoder as a topic modeler generators.bow.AutoEncodingTopicModeling.AutoencodingTopicModeler

### Appendix: Unzipping Model I/O¶

For previous versions of shorttext, the trained models are saved by calling:

>>> autoencoder.savemodel('/path/to/sub_autoencoder8')


The following files are produced for the autoencoder:

/path/to/sub_autoencoder.json
/path/to/sub_autoencoder.gensimdict
/path/to/sub_autoencoder_encoder.json
/path/to/sub_autoencoder_encoder.h5
/path/to/sub_autoencoder_classtopicvecs.pkl


If specifying save_complete_autoencoder=True, then four more files are found:

/path/to/sub_autoencoder_decoder.json
/path/to/sub_autoencoder_decoder.h5
/path/to/sub_autoencoder_autoencoder.json
/path/to/sub_autoencoder_autoencoder.h5


Users can load the same model later by entering:

>>> autoencoder2 = shorttext.classifiers.load_autoencoder_topic('/path/to/sub_autoencoder8', compact=False)


## Abstract Latent Topic Modeling Class¶

Both shorttext.generators.GensimTopicModeler and shorttext.generators.AutoencodingTopicModeler extends shorttext.generators.bow.LatentTopicModeling.LatentTopicModeler, an abstract class virtually. If user wants to develop its own topic model that extends this, he has to define the methods train, retrieve_topic_vec, loadmodel, and savemodel.

class shorttext.generators.bow.LatentTopicModeling.LatentTopicModeler(preprocessor=<function text_preprocessor.<locals>.<lambda>>, normalize=True)

Abstract class for various topic modeler.

generate_corpus(classdict)

Calculate the gensim dictionary and corpus, and extract the class labels from the training data. Called by train().

Parameters: classdict (dict) – training data None
get_batch_cos_similarities(shorttext)

Calculate the cosine similarities of the given short text and all the class labels.

This is an abstract method of this abstract class, which raise the NotImplementedException.

Parameters: shorttext (str) – short text topic vector NotImplementedException numpy.ndarray
loadmodel(nameprefix)

Load the model from files.

This is an abstract method of this abstract class, which raise the NotImplementedException.

Parameters: nameprefix (str) – prefix of the paths of the model files None NotImplementedException
retrieve_bow(shorttext)

Calculate the gensim bag-of-words representation of the given short text.

Parameters: shorttext (str) – text to be represented corpus representation of the text list
retrieve_bow_vector(shorttext, normalize=True)

Calculate the vector representation of the bag-of-words in terms of numpy.ndarray.

Parameters: shorttext (str) – short text normalize (bool) – whether the retrieved topic vectors are normalized. (Default: True) vector represtation of the text numpy.ndarray
retrieve_topicvec(shorttext)

Calculate the topic vector representation of the short text.

This is an abstract method of this abstract class, which raise the NotImplementedException.

Parameters: shorttext (str) – short text topic vector NotImplementedException numpy.ndarray
savemodel(nameprefix)

Save the model to files.

This is an abstract method of this abstract class, which raise the NotImplementedException.

Parameters: nameprefix (str) – prefix of the paths of the model files None NotImplementedException
train(classdict, nb_topics, *args, **kwargs)

Train the modeler.

This is an abstract method of this abstract class, which raise the NotImplementedException.

Parameters: classdict (dict) – training data nb_topics (int) – number of latent topics args – arguments to be passed into the wrapped training functions kwargs – arguments to be passed into the wrapped training functions None NotImplementedException
class shorttext.generators.bow.GensimTopicModeling.GensimTopicModeler(preprocessor=<function text_preprocessor.<locals>.<lambda>>, algorithm='lda', toweigh=True, normalize=True)

This class facilitates the creation of topic models (options: LDA (latent Dirichlet Allocation), LSI (latent semantic indexing), and Random Projections with the given short text training data, and convert future short text into topic vectors using the trained topic model.

No compact model I/O available for this class. Refer to LDAModeler and LSIModeler.

This class extends LatentTopicModeler.

get_batch_cos_similarities(shorttext)

Calculate the score, which is the cosine similarity with the topic vector of the model, of the short text against each class labels.

If neither train() nor loadmodel() was run, it will raise ModelNotTrainedException.

Parameters: shorttext (str) – short text dictionary of scores of the text to all classes ModelNotTrainedException dict
loadmodel(nameprefix)

Load the topic model with the given prefix of the file paths.

Given the prefix of the file paths, load the corresponding topic model. The files include a JSON (.json) file that specifies various parameters, a gensim dictionary (.gensimdict), and a topic model (.gensimmodel). If weighing is applied, load also the tf-idf model (.gensimtfidf).

Parameters: nameprefix (str) – prefix of the file paths None
retrieve_corpus_topicdist(shorttext)

Calculate the topic vector representation of the short text, in the corpus form.

If neither train() nor loadmodel() was run, it will raise ModelNotTrainedException.

Parameters: shorttext (str) – text to be represented topic vector in the corpus form ModelNotTrainedException list
retrieve_topicvec(shorttext)

Calculate the topic vector representation of the short text.

This function calls retrieve_corpus_topicdist().

If neither train() nor loadmodel() was run, it will raise ModelNotTrainedException.

Parameters: shorttext (str) – text to be represented topic vector ModelNotTrainedException numpy.ndarray
savemodel(nameprefix)

Save the model with names according to the prefix.

Given the prefix of the file paths, save the corresponding topic model. The files include a JSON (.json) file that specifies various parameters, a gensim dictionary (.gensimdict), and a topic model (.gensimmodel). If weighing is applied, load also the tf-idf model (.gensimtfidf).

If neither train() nor loadmodel() was run, it will raise ModelNotTrainedException.

Parameters: nameprefix (str) – prefix of the file paths None ModelNotTrainedException
train(classdict, nb_topics, *args, **kwargs)

Train the topic modeler.

Parameters: classdict (dict) – training data nb_topics (int) – number of latent topics args – arguments to pass to the train method for gensim topic models kwargs – arguments to pass to the train method for gensim topic models None
update(additional_classdict)

Update the model with additional data.

It updates the topic model with additional data.

Warning: It does not allow adding class labels, and new words. The dictionary is not changed. Therefore, such an update will alter the topic model only. It affects the topic vector representation. While the corpus is changed, the words pumped into calculating the similarity matrix is not changed.

Therefore, this function means for a fast update. But if you want a comprehensive model, it is recommended to retrain.

Parameters: additional_classdict (dict) – additional training data None
class shorttext.generators.bow.GensimTopicModeling.LDAModeler(preprocessor=<function text_preprocessor.<locals>.<lambda>>, toweigh=True, normalize=True)

This class facilitates the creation of LDA (latent Dirichlet Allocation) topic models, with the given short text training data, and convert future short text into topic vectors using the trained topic model.

This class extends GensimTopicModeler.

class shorttext.generators.bow.GensimTopicModeling.LSIModeler(preprocessor=<function text_preprocessor.<locals>.<lambda>>, toweigh=True, normalize=True)

This class facilitates the creation of LSI (latent semantic indexing) topic models, with the given short text training data, and convert future short text into topic vectors using the trained topic model.

This class extends GensimTopicModeler.

class shorttext.generators.bow.GensimTopicModeling.RPModeler(preprocessor=<function text_preprocessor.<locals>.<lambda>>, toweigh=True, normalize=True)

This class facilitates the creation of RP (random projection) topic models, with the given short text training data, and convert future short text into topic vectors using the trained topic model.

This class extends GensimTopicModeler.

shorttext.generators.bow.GensimTopicModeling.load_gensimtopicmodel(name, preprocessor=<function text_preprocessor.<locals>.<lambda>>, compact=True)

Load the gensim topic modeler from files.

Parameters: name (str) – name (if compact=True) or prefix (if compact=False) of the file path preprocessor (function) – function that preprocesses the text. (Default: shorttext.utils.textpreprocess.standard_text_preprocessor_1) compact (bool) – whether model file is compact (Default: True) a topic modeler GensimTopicModeler

### Appendix: Namespaces for Topic Modeler in Previous Versions¶

All generative topic modeling algorithms were placed under the package shorttext.classifiers for version <=0.3.4. In current version (>= 0.3.5), however, all generative models will be moved to shorttext.generators, while any classifiers making use of these topic models are still kept under shorttext.classifiers. A list include:

shorttext.classifiers.GensimTopicModeler  ->  shorttext.generators.GensimTopicModeler
shorttext.classifiers.LDAModeler  ->  shorttext.generators.LDAModeler
shorttext.classifiers.LSIModeler  ->  shorttext.generators.LSIModeler
shorttext.classifiers.RPModeler  ->  shorttext.generators.RPModeler
shorttext.classifiers.AutoencodingTopicModeler  ->  shorttext.generators.AutoencodingTopicModeler


Before release 0.5.6, for backward compatibility, developers can still call the topic models as if there were no such changes, although they are advised to make this change. However, effective release 0.5.7, this backward compatibility is no longer available.

## Classification Using Cosine Similarity¶

The topic modelers are trained to represent the short text in terms of a topic vector, effectively the feature vector. However, to perform supervised classification, there needs a classification algorithm. The first one is to calculate the cosine similarities between topic vectors of the given short text with those of the texts in all class labels.

If there is already a trained topic modeler, whether it is shorttext.generators.GensimTopicModeler or shorttext.generators.AutoencodingTopicModeler, a classifier based on cosine similarities can be initiated immediately without training. Taking the LDA example above, such classifier can be initiated as follow:

>>> cos_classifier = shorttext.classifiers.TopicVectorCosineDistanceClassifier(topicmodeler)


Or if the user already saved the topic modeler, one can initiate the same classifier by loading the topic modeler:

>>> cos_classifier = shorttext.classifiers.load_gensimtopicvec_cosineClassifier('/path/to/nihlda128.bin')


To perform prediction, enter:

>>> cos_classifier.score('stem cell research')


which outputs a dictionary with labels and the corresponding scores.

The same thing for autoencoder, but the classifier based on autoencoder can be loaded by another function:

>>> cos_classifier = shorttext.classifiers.load_autoencoder_cosineClassifier('/path/to/sub_autoencoder8.bin')

class shorttext.classifiers.bow.topic.TopicVectorDistanceClassification.TopicVecCosineDistanceClassifier(topicmodeler)

This is a class that implements a classifier that perform classification based on the cosine similarity between the topic vectors of the user-input short texts and various classes. The topic vectors are calculated using LatentTopicModeler.

loadmodel(nameprefix)

Load the topic model with the given prefix of the file paths.

Given the prefix of the file paths, load the corresponding topic model. The files include a JSON (.json) file that specifies various parameters, a gensim dictionary (.gensimdict), and a topic model (.gensimmodel). If weighing is applied, load also the tf-idf model (.gensimtfidf).

This is essentialing loading the topic modeler LatentTopicModeler.

Parameters: nameprefix (str) – prefix of the file paths None
savemodel(nameprefix)

Save the model with names according to the prefix.

Given the prefix of the file paths, save the corresponding topic model. The files include a JSON (.json) file that specifies various parameters, a gensim dictionary (.gensimdict), and a topic model (.gensimmodel). If weighing is applied, load also the tf-idf model (.gensimtfidf).

If neither train() nor loadmodel() was run, it will raise ModelNotTrainedException.

This is essentialing saving the topic modeler LatentTopicModeler.

Parameters: nameprefix (str) – prefix of the file paths None ModelNotTrainedException
score(shorttext)

Calculate the score, which is the cosine similarity with the topic vector of the model, of the short text against each class labels.

Parameters: shorttext (str) – short text dictionary of scores of the text to all classes dict
shorttext.classifiers.bow.topic.TopicVectorDistanceClassification.load_autoencoder_cosineClassifier(name, preprocessor=<function text_preprocessor.<locals>.<lambda>>, compact=True)

Load an autoencoder from files for topic modeling, and return a cosine classifier.

Given the prefix of the file paths, load the model into files, with name given by the prefix. There are files with names ending with “_encoder.json” and “_encoder.h5”, which are the JSON and HDF5 files for the encoder respectively. They also include a gensim dictionary (.gensimdict).

Parameters: name (str) – name (if compact=True) or prefix (if compact=False) of the file paths preprocessor (function) – function that preprocesses the text. (Default: utils.textpreprocess.standard_text_preprocessor_1) compact (bool) – whether model file is compact (Default: True) a classifier that scores the short text based on the autoencoder TopicVecCosineDistanceClassifier
shorttext.classifiers.bow.topic.TopicVectorDistanceClassification.load_gensimtopicvec_cosineClassifier(name, preprocessor=<function text_preprocessor.<locals>.<lambda>>, compact=True)

Load a gensim topic model from files and return a cosine distance classifier.

Given the prefix of the files of the topic model, return a cosine distance classifier based on this model, i.e., TopicVecCosineDistanceClassifier.

The files include a JSON (.json) file that specifies various parameters, a gensim dictionary (.gensimdict), and a topic model (.gensimmodel). If weighing is applied, load also the tf-idf model (.gensimtfidf).

Parameters: name (str) – name (if compact=True) or prefix (if compact=False) of the file paths preprocessor (function) – function that preprocesses the text. (Default: utils.textpreprocess.standard_text_preprocessor_1) compact (bool) – whether model file is compact (Default: True) a classifier that scores the short text based on the topic model TopicVecCosineDistanceClassifier
shorttext.classifiers.bow.topic.TopicVectorDistanceClassification.train_autoencoder_cosineClassifier(classdict, nb_topics, preprocessor=<function text_preprocessor.<locals>.<lambda>>, normalize=True, *args, **kwargs)

Return a cosine distance classifier, i.e., TopicVecCosineDistanceClassifier, while training an autoencoder as a topic model in between.

Parameters: classdict (dict) – training data nb_topics (int) – number of topics, i.e., number of encoding dimensions preprocessor (function) – function that preprocesses the text. (Default: utils.textpreprocess.standard_text_preprocessor_1) normalize (bool) – whether the retrieved topic vectors are normalized. (Default: True) args – arguments to be passed to keras model fitting kwargs – arguments to be passed to keras model fitting a classifier that scores the short text based on the autoencoder TopicVecCosineDistanceClassifier
shorttext.classifiers.bow.topic.TopicVectorDistanceClassification.train_gensimtopicvec_cosineClassifier(classdict, nb_topics, preprocessor=<function text_preprocessor.<locals>.<lambda>>, algorithm='lda', toweigh=True, normalize=True, *args, **kwargs)

Return a cosine distance classifier, i.e., TopicVecCosineDistanceClassifier, while training a gensim topic model in between.

Parameters: classdict (dict) – training data nb_topics (int) – number of latent topics preprocessor (function) – function that preprocesses the text. (Default: utils.textpreprocess.standard_text_preprocessor_1) algorithm (str) – algorithm for topic modeling. Options: lda, lsi, rp. (Default: lda) toweigh (bool) – whether to weigh the words using tf-idf. (Default: True) normalize (bool) – whether the retrieved topic vectors are normalized. (Default: True) args – arguments to pass to the train method for gensim topic models kwargs – arguments to pass to the train method for gensim topic models a classifier that scores the short text based on the topic model TopicVecCosineDistanceClassifier

## Classification Using Scikit-Learn Classifiers¶

The topic modeler can be used to generate features used for other machine learning algorithms. We can take any supervised learning algorithms in scikit-learn here. We use Gaussian naive Bayes as an example. For faster demonstration, use the subject keywords as the example dataset.

>>> subtopicmodeler = shorttext.generators.LDAModeler()
>>> subtopicmodeler.train(subdict, 8)


We first import the class:

>>> from sklearn.naive_bayes import GaussianNB


And we train the classifier:

>>> classifier = shorttext.classifiers.TopicVectorSkLearnClassifier(subtopicmodeler, GaussianNB())
>>> classifier.train(subdict)


Predictions can be performed like the following example:

>>> classifier.score('functional integral')


which outputs a dictionary with labels and the corresponding scores.

You can save the model by:

>>> classifier.save_compact_model('/path/to/sublda8nb.bin')


where the argument specifies the prefix of the path of the model files, including the topic models, and the scikit-learn model files. The classifier can be loaded by calling:

>>> classifier2 = shorttext.classifiers.load_gensim_topicvec_sklearnclassifier('/path/to/sublda8nb.bin')


The topic modeler here can also be an autoencoder, by putting subtopicmodeler as the autoencoder will still do the work. However, to load the saved classifier with an autoencoder model, do

>>> classifier2 = shorttext.classifiers.load_autoencoder_topic_sklearnclassifier('/path/to/filename.bin')


Compact model files saved by TopicVectorSkLearnClassifier in shorttext >= 1.0.0 cannot be read by earlier version of shorttext; vice versa is not true though: old compact model files can be read in.

class shorttext.classifiers.bow.topic.SkLearnClassification.TopicVectorSkLearnClassifier(topicmodeler, sklearn_classifier)

This is a classifier that wraps any supervised learning algorithm in scikit-learn, and use the topic vectors output by the topic modeler LatentTopicModeler that wraps the topic models in gensim.

# Reference

Xuan Hieu Phan, Cam-Tu Nguyen, Dieu-Thu Le, Minh Le Nguyen, Susumu Horiguchi, Quang-Thuy Ha, “A Hidden Topic-Based Framework toward Building Applications with Short Web Documents,” IEEE Trans. Knowl. Data Eng. 23(7): 961-976 (2011).

Xuan Hieu Phan, Le-Minh Nguyen, Susumu Horiguchi, “Learning to Classify Short and Sparse Text & Web withHidden Topics from Large-scale Data Collections,” WWW ‘08 Proceedings of the 17th international conference on World Wide Web. (2008) [ACL]

classify(shorttext)

Give the highest-scoring class of the given short text according to the classifier.

If neither train() nor loadmodel() was run, or if the topic model was not trained, it will raise ModelNotTrainedException.

Parameters: shorttext (str) – short text class label of the classification result of the given short text ModelNotTrainedException str
getvector(shorttext)

Retrieve the topic vector representation of the given short text.

If the topic modeler does not have a trained model, it will raise ModelNotTrainedException.

Parameters: shorttext (str) – short text topic vector representation ModelNotTrainedException numpy.ndarray
load_compact_model(name)

Load the classification model together with the topic model from a compact file.

Parameters: name (str) – name of the compact model file None
loadmodel(nameprefix)

Load the classification model together with the topic model.

Parameters: nameprefix (str) – prefix of the paths of the model files None
save_compact_model(name)

Save the model.

Save the topic model and the trained scikit-learn classification model in one compact model file.

If neither train() nor loadmodel() was run, or if the topic model was not trained, it will raise ModelNotTrainedException.

Parameters: name (str) – name of the compact model file None
savemodel(nameprefix)

Save the model.

Save the topic model and the trained scikit-learn classification model. The scikit-learn model will have the name nameprefix followed by the extension .pkl. The topic model is the same as the one in LatentTopicModeler.

If neither train() nor loadmodel() was run, or if the topic model was not trained, it will raise ModelNotTrainedException.

Parameters: nameprefix (str) – prefix of the paths of the model files None ModelNotTrainedException
score(shorttext)

Calculate the score, which is the cosine similarity with the topic vector of the model, of the short text against each class labels.

If neither train() nor loadmodel() was run, or if the topic model was not trained, it will raise ModelNotTrainedException.

Parameters: shorttext (str) – short text dictionary of scores of the text to all classes ModelNotTrainedException dict
train(classdict, *args, **kwargs)

Train the classifier.

If the topic modeler does not have a trained model, it will raise ModelNotTrainedException.

Parameters: classdict (dict) – training data args – arguments to be passed to the fit method of the scikit-learn classifier kwargs – arguments to be passed to the fit method of the scikit-learn classifier None ModelNotTrainedException
shorttext.classifiers.bow.topic.SkLearnClassification.load_autoencoder_topic_sklearnclassifier(name, preprocessor=<function text_preprocessor.<locals>.<lambda>>, compact=True)
Load the classifier, a wrapper that uses scikit-learn classifier, with
feature vectors given by an autocoder topic model, from files.

# Reference

Xuan Hieu Phan, Cam-Tu Nguyen, Dieu-Thu Le, Minh Le Nguyen, Susumu Horiguchi, Quang-Thuy Ha, “A Hidden Topic-Based Framework toward Building Applications with Short Web Documents,” IEEE Trans. Knowl. Data Eng. 23(7): 961-976 (2011).

Xuan Hieu Phan, Le-Minh Nguyen, Susumu Horiguchi, “Learning to Classify Short and Sparse Text & Web withHidden Topics from Large-scale Data Collections,” WWW ‘08 Proceedings of the 17th international conference on World Wide Web. (2008) [ACL]

Parameters: name (str) – name (if compact==True) or prefix (if compact==False) of the paths of model files preprocessor (function) – function that preprocesses the text (Default: utils.textpreprocess.standard_text_preprocessor_1) compact (bool) – whether model file is compact (Default: True) a trained classifier TopicVectorSkLearnClassifier
shorttext.classifiers.bow.topic.SkLearnClassification.load_gensim_topicvec_sklearnclassifier(name, preprocessor=<function text_preprocessor.<locals>.<lambda>>, compact=True)
Load the classifier, a wrapper that uses scikit-learn classifier, with
feature vectors given by a topic model, from files.

# Reference

Xuan Hieu Phan, Cam-Tu Nguyen, Dieu-Thu Le, Minh Le Nguyen, Susumu Horiguchi, Quang-Thuy Ha, “A Hidden Topic-Based Framework toward Building Applications with Short Web Documents,” IEEE Trans. Knowl. Data Eng. 23(7): 961-976 (2011).

Xuan Hieu Phan, Le-Minh Nguyen, Susumu Horiguchi, “Learning to Classify Short and Sparse Text & Web withHidden Topics from Large-scale Data Collections,” WWW ‘08 Proceedings of the 17th international conference on World Wide Web. (2008) [ACL]

Parameters: name (str) – name (if compact==True) or prefix (if compact==False) of the paths of model files preprocessor (function) – function that preprocesses the text (Default: utils.textpreprocess.standard_text_preprocessor_1) compact (bool) – whether model file is compact (Default: True) a trained classifier TopicVectorSkLearnClassifier
shorttext.classifiers.bow.topic.SkLearnClassification.train_autoencoder_topic_sklearnclassifier(classdict, nb_topics, sklearn_classifier, preprocessor=<function text_preprocessor.<locals>.<lambda>>, normalize=True, keras_paramdict={}, sklearn_paramdict={})

Train the supervised learning classifier, with features given by topic vectors.

It trains an autoencoder topic model, and with its encoded vector representation, train a supervised learning classifier. The instantiated (not trained) scikit-learn classifier must be passed into the argument.

# Reference

Xuan Hieu Phan, Cam-Tu Nguyen, Dieu-Thu Le, Minh Le Nguyen, Susumu Horiguchi, Quang-Thuy Ha, “A Hidden Topic-Based Framework toward Building Applications with Short Web Documents,” IEEE Trans. Knowl. Data Eng. 23(7): 961-976 (2011).

Xuan Hieu Phan, Le-Minh Nguyen, Susumu Horiguchi, “Learning to Classify Short and Sparse Text & Web withHidden Topics from Large-scale Data Collections,” WWW ‘08 Proceedings of the 17th international conference on World Wide Web. (2008) [ACL]

Parameters: classdict (dict) – training data nb_topics (int) – number topics, i.e., number of encoding dimensions sklearn_classifier (sklearn.base.BaseEstimator) – instantiated scikit-learn classifier preprocessor (function) – function that preprocesses the text (Default: utils.textpreprocess.standard_text_preprocessor_1) normalize (bool) – whether the retrieved topic vectors are normalized (Default: True) keras_paramdict – arguments to be passed to keras for training autoencoder sklearn_paramdict – arguemtnst to be passed to scikit-learn for fitting the classifier a trained classifier TopicVectorSkLearnClassifier
shorttext.classifiers.bow.topic.SkLearnClassification.train_gensim_topicvec_sklearnclassifier(classdict, nb_topics, sklearn_classifier, preprocessor=<function text_preprocessor.<locals>.<lambda>>, topicmodel_algorithm='lda', toweigh=True, normalize=True, gensim_paramdict={}, sklearn_paramdict={})

Train the supervised learning classifier, with features given by topic vectors.

It trains a topic model, and with its topic vector representation, train a supervised learning classifier. The instantiated (not trained) scikit-learn classifier must be passed into the argument.

# Reference

Xuan Hieu Phan, Cam-Tu Nguyen, Dieu-Thu Le, Minh Le Nguyen, Susumu Horiguchi, Quang-Thuy Ha, “A Hidden Topic-Based Framework toward Building Applications with Short Web Documents,” IEEE Trans. Knowl. Data Eng. 23(7): 961-976 (2011).

Xuan Hieu Phan, Le-Minh Nguyen, Susumu Horiguchi, “Learning to Classify Short and Sparse Text & Web withHidden Topics from Large-scale Data Collections,” WWW ‘08 Proceedings of the 17th international conference on World Wide Web. (2008) [ACL]

Parameters: classdict (dict) – training data nb_topics (int) – number of topics in the topic model sklearn_classifier (sklearn.base.BaseEstimator) – instantiated scikit-learn classifier preprocessor (function) – function that preprocesses the text (Default: utils.textpreprocess.standard_text_preprocessor_1) topicmodel_algorithm (str) – topic model algorithm (Default: ‘lda’) toweigh (bool) – whether to weigh the words using tf-idf (Default: True) normalize (bool) – whether the retrieved topic vectors are normalized (Default: True) gensim_paramdict (dict) – arguments to be passed on to the train method of the gensim topic model sklearn_paramdict (dict) – arguments to be passed on to the fit method of the sklearn classification algorithm a trained classifier TopicVectorSkLearnClassifier

## Notes about Text Preprocessing¶

The topic models are based on bag-of-words model, and text preprocessing is very important. However, the text preprocessing step cannot be serialized. The users should keep track of the text preprocessing step on their own. Unless it is necessary, use the standard preprocessing.

See more: Text Preprocessing .

## Reference¶

David M. Blei, “Probabilistic Topic Models,” Communications of the ACM 55(4): 77-84 (2012).

Francois Chollet, “Building Autoencoders in Keras,” The Keras Blog. [Keras]

Xuan Hieu Phan, Cam-Tu Nguyen, Dieu-Thu Le, Minh Le Nguyen, Susumu Horiguchi, Quang-Thuy Ha, “A Hidden Topic-Based Framework toward Building Applications with Short Web Documents,” IEEE Trans. Knowl. Data Eng. 23(7): 961-976 (2011).

Xuan Hieu Phan, Le-Minh Nguyen, Susumu Horiguchi, “Learning to Classify Short and Sparse Text & Web withHidden Topics from Large-scale Data Collections,” WWW ‘08 Proceedings of the 17th international conference on World Wide Web. (2008) [ACL]