Supervised Classification with Topics as Features

Topic Vectors as Intermediate Feature Vectors

To perform classification using bag-of-words (BOW) model as features, nltk and gensim offered good framework. But the feature vectors of short text represented by BOW can be very sparse. And the relationships between words with similar meanings are ignored as well. One of the way to tackle this is to use topic modeling, i.e. representing the words in a topic vector. This package provides the following ways to model the topics:

LDA (Latent Dirichlet Allocation)
LSI (Latent Semantic Indexing)
RP (Random Projections)
Autoencoder

With the topic representations, users can use any supervised learning algorithm provided by scikit-learn to perform the classification.

Topic Models in gensim: LDA, LSI, and Random Projections

This package supports three algorithms provided by gensim, namely, LDA, LSI, and Random Projections, to do the topic modeling.

>>> import shorttext

First, load a set of training data (all NIH data in this example):

>>> trainclassdict = shorttext.data.nihreports(sample_size=None)

Initialize an instance of topic modeler, and use LDA as an example:

>>> topicmodeler = shorttext.generators.LDAModeler()

For other algorithms, user can use LSIModeler for LSI or RPModeler for RP. Everything else is the same. To train with 128 topics, enter:

>>> topicmodeler.train(trainclassdict, 128)

After the training is done, the user can retrieve the topic vector representation with the trained model. For example,

>>> topicmodeler.retrieve_topicvec('stem cell research')

>>> topicmodeler.retrieve_topicvec('informatics')

By default, the vectors are normalized. Another way to retrieve the topic vector representation is as follow:

>>> topicmodeler['stem cell research']

>>> topicmodeler['informatics']

If the dictionary does not have the processed tokens, it will return a numpy array with all values nan.

In the training and the retrieval above, the same preprocessing process is applied. Users can provide their own preprocessor while initiating the topic modeler.

Users can save the trained model by calling:

>>> topicmodeler.save_compact_model('/path/to/nihlda128.bin')

And the topic model can be retrieved by calling:

>>>import shorttext.generators topicmodeler2 = shorttext.generators.GensimTopicModeler(‘/path/to/nihlda128.bin’)

While initialize the instance of the topic modeler, the user can also specify whether to weigh the terms using tf-idf (term frequency - inverse document frequency). The default is to weigh. To not weigh, initialize it as

>>> topicmodeler3 = shorttext.generators.GensimTopicModeler(toweigh=False)

class shorttext.generators.bow.GensimTopicModeling.GensimTopicModeler(preprocessor: callable | None = None, tokenizer: callable | None = None, algorithm: Literal['lda', 'lsi', 'rp'] = 'lda', toweigh: bool = True, normalize: bool = True)[source]

Bases: LatentTopicModeler

Topic modeler using gensim implementations.

Supports LDA (Latent Dirichlet Allocation), LSI (Latent Semantic Indexing), and Random Projections (RP) for topic modeling.

Note:: For compact model I/O, use LDAModeler or LSIModeler instead.

__init__(preprocessor: callable | None = None, tokenizer: callable | None = None, algorithm: Literal['lda', 'lsi', 'rp'] = 'lda', toweigh: bool = True, normalize: bool = True)[source]

Initialize the topic modeler.

Args:: preprocessor: Text preprocessing function. Default: standard_text_preprocessor_1. algorithm: Topic modeling algorithm. Options: ‘lda’, ‘lsi’, ‘rp’. Default: ‘lda’. toweigh: Whether to apply tf-idf weighting. Default: True. normalize: Whether to normalize topic vectors. Default: True.

generate_corpus(classdict: dict[str, list[str]]) → None[source]

Generate gensim dictionary and corpus.

Args:: classdict: Training data.

train(classdict: dict[str, list[str]], nb_topics: int, *args, **kwargs) → None[source]

Train the topic modeler.

Args:: classdict: Training data with class labels as keys and texts as values. nb_topics: Number of latent topics. *args: Arguments for the gensim topic model. **kwargs: Keyword arguments for the gensim topic model.

update(additional_classdict: dict[str, list[str]]) → None[source]

Update model with additional data.

Warning: Does not support adding new class labels or new vocabulary. For comprehensive updates, retrain the model.

Args:: additional_classdict: Additional training data.

retrieve_bow(shorttext: str) → list[tuple[int, int]][source]

Get bag-of-words representation.

Args:: shorttext: Input text.
Returns:: List of (word_id, count) tuples.

retrieve_bow_vector(shorttext: str) → ndarray[tuple[Any, ...], dtype[float64]][source]

Get bag-of-words vector.

Args:: shorttext: Input text.
Returns:: BOW vector.

retrieve_corpus_topicdist(shorttext: str) → list[tuple[int, int | float]][source]

Get topic distribution (corpus form).

Args:: shorttext: Input text.
Returns:: List of (topic_id, weight) tuples.
Raises:: ModelNotTrainedException: If model not trained.

retrieve_topicvec(shorttext: str) → ndarray[tuple[Any, ...], dtype[float64]][source]

Get topic vector for short text.

Args:: shorttext: Input text.
Returns:: Topic vector.
Raises:: ModelNotTrainedException: If model not trained.

get_batch_cos_similarities(shorttext: str) → dict[str, float][source]

Get cosine similarities to all classes.

Args:: shorttext: Input text.
Returns:: Dictionary mapping class labels to similarity scores.
Raises:: ModelNotTrainedException: If model not trained.

loadmodel(nameprefix: str) → None[source]

Load topic model from files.

Args:: nameprefix: Prefix for input files.

savemodel(nameprefix: str) → None[source]

Save topic model to files.

Args:: nameprefix: Prefix for output files.
Raises:: ModelNotTrainedException: If model not trained.

get_info() → dict[str, Any][source]

Get model metadata.

Returns:: Dictionary with model information.

classmethod from_pretrained(name: str, preprocessor: callable | None = None, tokenizer: callable | None = None, compact: bool = True) → Self[source]

Load a gensim topic model from files.

Args:: name: Model name (compact) or file prefix (non-compact). preprocessor: Text preprocessing function. compact: Whether to load compact model. Default: True.
Returns:: A topic modeler instance.

class shorttext.generators.bow.GensimTopicModeling.LDAModeler(preprocessor: callable | None = None, tokenizer: callable | None = None, toweigh: bool = True, normalize: bool = True)[source]

Bases: GensimTopicModeler, CompactIOMachine

LDA topic modeler with compact I/O support.

get_info() → dict[str, Any][source]

Get model metadata.

Returns:: Dictionary with model information.

class shorttext.generators.bow.GensimTopicModeling.LSIModeler(preprocessor: callable | None = None, tokenizer: callable | None = None, toweigh: bool = True, normalize: bool = True)[source]

Bases: GensimTopicModeler, CompactIOMachine

LSI topic modeler with compact I/O support.

get_info() → dict[str, Any][source]

Get model metadata.

Returns:: Dictionary with model information.

class shorttext.generators.bow.GensimTopicModeling.RPModeler(preprocessor: callable | None = None, tokenizer: callable | None = None, toweigh: bool = True, normalize: bool = True)[source]

Bases: GensimTopicModeler, CompactIOMachine

Random Projection topic modeler with compact I/O support.

get_info() → dict[str, Any][source]

Get model metadata.

Returns:: Dictionary with model information.

shorttext.generators.bow.GensimTopicModeling.load_gensimtopicmodel(name: str, preprocessor: callable | None = None, tokenizer: callable | None = None, compact: bool = True) → GensimTopicModeler[source]: Deprecated. Use ~GensimTopicModeler.from_pretrained.

Deprecated since version 4.0.1: This will be removed in 5.0.0.

AutoEncoder

Another way to find a new topic vector representation is to use the autoencoder, a neural network model which compresses a vector representation into another one of a shorter (or longer, rarely though) representation, by minimizing the difference between the input layer and the decoding layer. For faster demonstration, use the subject keywords as the example dataset:

>>> subdict = shorttext.data.subjectkeywords()

To train such a model, we perform in a similar way with the LDA model (or LSI and random projections above):

>>> autoencoder = shorttext.generators.AutoencodingTopicModeler()
>>> autoencoder.train(subdict, 8)

After the training is done, the user can retrieve the encoded vector representation with the trained autoencoder model. For example,

>>> autoencoder.retrieve_topicvec('linear algebra')

>>> autoencoder.retrieve_topicvec('path integral')

By default, the vectors are normalized. Another way to retrieve the topic vector representation is as follow:

>>> autoencoder['linear algebra']

>>> autoencoder['path integral']

In the training and the retrieval above, the same preprocessing process is applied. Users can provide their own preprocessor while initiating the topic modeler.

Users can save the trained models, by calling:

>>> autoencoder.save_compact_model('/path/to/sub_autoencoder8.bin')

And the model can be retrieved by calling:

>>> autoencoder2 = shorttext.generators.load_autoencoder_topicmodel('/path/to/sub_autoencoder8.bin')

Like other topic models, while initialize the instance of the topic modeler, the user can also specify whether to weigh the terms using tf-idf (term frequency - inverse document frequency). The default is to weigh. To not weigh, initialize it as:

>>> autoencoder3 = shorttext.generators.AutoencodingTopicModeler(toweigh=False)

shorttext.generators.bow.AutoEncodingTopicModeling.get_autoencoder_models(vector_size: int, nb_latent_vector_size: int) → AutoEncoderPackage[source]

Create autoencoder model components.

Args:: vector_size: Size of input vectors. nb_latent_vector_size: Size of the latent space (number of topics).
Returns:: AutoEncoderPackage containing autoencoder, encoder, and decoder models.

class shorttext.generators.bow.AutoEncodingTopicModeling.AutoencodingTopicModeler(preprocessor: callable | None = None, tokenizer: callable | None = None, normalize: bool = True)[source]

Bases: LatentTopicModeler, CompactIOMachine

Topic modeler using autoencoder.

Uses a Keras autoencoder to learn latent topic representations. The encoded vectors serve as topic vectors for short text classification.

Reference:: Francois Chollet, “Building Autoencoders in Keras,” https://blog.keras.io/building-autoencoders-in-keras.html

train(classdict: dict[str, list[str]], nb_topics: int, *args, **kwargs) → None[source]

Train the autoencoder topic model.

Args:: classdict: Training data with class labels as keys and texts as values. nb_topics: Number of latent topics (encoding dimensions). *args: Arguments for Keras model fitting. **kwargs: Keyword arguments for Keras model fitting.

retrieve_bow(shorttext: str) → list[tuple[int, int]][source]

Get bag-of-words representation.

Args:: shorttext: Input text.
Returns:: List of (token_index, count) tuples.

retrieve_bow_vector(shorttext: str) → ndarray[tuple[Any, ...], dtype[float64]][source]

Get bag-of-words vector.

Args:: shorttext: Input text.
Returns:: BOW vector (normalized if normalize=True).

retrieve_topicvec(shorttext: str) → ndarray[tuple[Any, ...], dtype[float64]][source]

Get topic vector for short text.

Args:: shorttext: Input text.
Returns:: Encoded vector representation.
Raises:: ModelNotTrainedException: If model not trained.

precalculate_liststr_topicvec(shorttexts: list[str]) → ndarray[tuple[Any, ...], dtype[float64]][source]

Calculate average topic vector for a list of texts.

Used during training to compute class centroids.

Args:: shorttexts: List of texts.
Returns:: Average topic vector (normalized).
Raises:: ModelNotTrainedException: If model not trained.

get_batch_cos_similarities(shorttext: str) → dict[str, float][source]

Get cosine similarities to all class centroids.

Args:: shorttext: Input text.
Returns:: Dictionary mapping class labels to similarity scores.
Raises:: ModelNotTrainedException: If model not trained.

savemodel(nameprefix: str, save_complete_autoencoder: bool = True) → None[source]

Save the autoencoder model to files.

Saves encoder, optional decoder, and autoencoder weights along with configuration parameters.

Args:: nameprefix: Prefix for output files. save_complete_autoencoder: Whether to save decoder and complete autoencoder. Default: True.
Raises:: ModelNotTrainedException: If model not trained.

loadmodel(nameprefix: str, load_incomplete: bool = False) → None[source]

Load the autoencoder model from files.

Args:: nameprefix: Prefix for input files. load_incomplete: If True, only load encoder (for models from v0.2.1). Default: False.
Raises:: ModelNotTrainedException: If loading fails.

get_info() → dict[str, Any][source]

Get model metadata.

Returns:: Dictionary with model information.

classmethod from_pretrained(name: str, preprocessor: callable | None = None, tokenizer: callable | None = None, compact: bool = True) → Self[source]

Load an autoencoder topic model from files.

Args:: name: Model name (compact) or file prefix (non-compact). preprocessor: Text preprocessing function. compact: Whether to load compact model. Default: True.
Returns:: An AutoencodingTopicModeler instance.

shorttext.generators.bow.AutoEncodingTopicModeling.load_autoencoder_topicmodel(name: str, preprocessor: callable | None = None, tokenizer: callable | None = None, compact: bool = True) → AutoencodingTopicModeler[source]: Deprecated. Use ~AutoEncodingTopicModeling.from_pretrained.

Deprecated since version 4.0.1: This will be removed in 5.0.0.

Abstract Latent Topic Modeling Class

Both shorttext.generators.GensimTopicModeler and shorttext.generators.AutoencodingTopicModeler extends shorttext.generators.bow.LatentTopicModeling.LatentTopicModeler, an abstract class virtually. If user wants to develop its own topic model that extends this, he has to define the methods train, retrieve_topic_vec, loadmodel, and savemodel.

class shorttext.generators.bow.LatentTopicModeling.LatentTopicModeler(preprocessor: callable | None = None, tokenizer: callable | None = None, normalize: bool = True)[source]

Bases: ABC

Abstract base class for topic modelers.

Provides interface for converting short texts to topic vector representations using various topic modeling algorithms.

__init__(preprocessor: callable | None = None, tokenizer: callable | None = None, normalize: bool = True)[source]

Initialize the topic modeler.

Args:: preprocessor: Text preprocessing function. Default: standard_text_preprocessor_1. tokenizer: Tokenization function. Default: tokenize. normalize: Whether to normalize output vectors. Default: True.

abstractmethod train(classdict: dict[str, list[str]], nb_topics: int, *args, **kwargs) → None[source]

Train the topic modeler.

Args:: classdict: Training data with class labels as keys and texts as values. nb_topics: Number of latent topics. *args: Additional arguments for the training algorithm. **kwargs: Additional keyword arguments.
Raises:: NotImplementedError: This is an abstract method.

abstractmethod retrieve_bow(shorttext: str) → list[tuple[int, int]][source]

Get bag-of-words representation.

Args:: shorttext: Input text.
Returns:: List of (word_id, count) tuples.
Raises:: NotImplementedError: Abstract method.

abstractmethod retrieve_bow_vector(shorttext: str) → ndarray[tuple[Any, ...], dtype[float64]][source]

Get bag-of-words vector.

Args:: shorttext: Input text.
Returns:: BOW vector.
Raises:: NotImplementedError: Abstract method.

abstractmethod retrieve_topicvec(shorttext: str) → ndarray[tuple[Any, ...], dtype[float64]][source]

Get topic vector for short text.

Args:: shorttext: Input text.
Returns:: Topic vector.
Raises:: NotImplementedError: Abstract method.

abstractmethod get_batch_cos_similarities(shorttext: str) → dict[str, float][source]

Get cosine similarities to all classes.

Args:: shorttext: Input text.
Returns:: Dictionary mapping class labels to similarity scores.
Raises:: NotImplementedError: Abstract method.

__getitem__(shorttext) → ndarray[tuple[Any, ...], dtype[float64]][source]: Get topic vector for text (shortcut for retrieve_topicvec).

__contains__(shorttext)[source]: Check if model is trained.

abstractmethod loadmodel(nameprefix: str)[source]

Load model from files.

Args:: nameprefix: Prefix for input files.
Raises:: NotImplementedError: Abstract method.

abstractmethod savemodel(nameprefix: str)[source]

Save model to files.

Args:: nameprefix: Prefix for output files.
Raises:: NotImplementedError: Abstract method.

abstractmethod get_info() → dict[str, Any][source]

Get model metadata.

Returns:: Dictionary with model information.

class shorttext.generators.bow.GensimTopicModeling.GensimTopicModeler(preprocessor: callable | None = None, tokenizer: callable | None = None, algorithm: Literal['lda', 'lsi', 'rp'] = 'lda', toweigh: bool = True, normalize: bool = True)[source]

Bases: LatentTopicModeler

Topic modeler using gensim implementations.

Supports LDA (Latent Dirichlet Allocation), LSI (Latent Semantic Indexing), and Random Projections (RP) for topic modeling.

Note:: For compact model I/O, use LDAModeler or LSIModeler instead.

__init__(preprocessor: callable | None = None, tokenizer: callable | None = None, algorithm: Literal['lda', 'lsi', 'rp'] = 'lda', toweigh: bool = True, normalize: bool = True)[source]

Initialize the topic modeler.

Args:: preprocessor: Text preprocessing function. Default: standard_text_preprocessor_1. algorithm: Topic modeling algorithm. Options: ‘lda’, ‘lsi’, ‘rp’. Default: ‘lda’. toweigh: Whether to apply tf-idf weighting. Default: True. normalize: Whether to normalize topic vectors. Default: True.

generate_corpus(classdict: dict[str, list[str]]) → None[source]

Generate gensim dictionary and corpus.

Args:: classdict: Training data.

train(classdict: dict[str, list[str]], nb_topics: int, *args, **kwargs) → None[source]

Train the topic modeler.

Args:: classdict: Training data with class labels as keys and texts as values. nb_topics: Number of latent topics. *args: Arguments for the gensim topic model. **kwargs: Keyword arguments for the gensim topic model.

update(additional_classdict: dict[str, list[str]]) → None[source]

Update model with additional data.

Warning: Does not support adding new class labels or new vocabulary. For comprehensive updates, retrain the model.

Args:: additional_classdict: Additional training data.

retrieve_bow(shorttext: str) → list[tuple[int, int]][source]

Get bag-of-words representation.

Args:: shorttext: Input text.
Returns:: List of (word_id, count) tuples.

retrieve_bow_vector(shorttext: str) → ndarray[tuple[Any, ...], dtype[float64]][source]

Get bag-of-words vector.

Args:: shorttext: Input text.
Returns:: BOW vector.

retrieve_corpus_topicdist(shorttext: str) → list[tuple[int, int | float]][source]

Get topic distribution (corpus form).

Args:: shorttext: Input text.
Returns:: List of (topic_id, weight) tuples.
Raises:: ModelNotTrainedException: If model not trained.

retrieve_topicvec(shorttext: str) → ndarray[tuple[Any, ...], dtype[float64]][source]

Get topic vector for short text.

Args:: shorttext: Input text.
Returns:: Topic vector.
Raises:: ModelNotTrainedException: If model not trained.

get_batch_cos_similarities(shorttext: str) → dict[str, float][source]

Get cosine similarities to all classes.

Args:: shorttext: Input text.
Returns:: Dictionary mapping class labels to similarity scores.
Raises:: ModelNotTrainedException: If model not trained.

loadmodel(nameprefix: str) → None[source]

Load topic model from files.

Args:: nameprefix: Prefix for input files.

savemodel(nameprefix: str) → None[source]

Save topic model to files.

Args:: nameprefix: Prefix for output files.
Raises:: ModelNotTrainedException: If model not trained.

get_info() → dict[str, Any][source]

Get model metadata.

Returns:: Dictionary with model information.

classmethod from_pretrained(name: str, preprocessor: callable | None = None, tokenizer: callable | None = None, compact: bool = True) → Self[source]

Load a gensim topic model from files.

Args:: name: Model name (compact) or file prefix (non-compact). preprocessor: Text preprocessing function. compact: Whether to load compact model. Default: True.
Returns:: A topic modeler instance.

class shorttext.generators.bow.GensimTopicModeling.LDAModeler(preprocessor: callable | None = None, tokenizer: callable | None = None, toweigh: bool = True, normalize: bool = True)[source]

Bases: GensimTopicModeler, CompactIOMachine

LDA topic modeler with compact I/O support.

get_info() → dict[str, Any][source]

Get model metadata.

Returns:: Dictionary with model information.

class shorttext.generators.bow.GensimTopicModeling.LSIModeler(preprocessor: callable | None = None, tokenizer: callable | None = None, toweigh: bool = True, normalize: bool = True)[source]

Bases: GensimTopicModeler, CompactIOMachine

LSI topic modeler with compact I/O support.

get_info() → dict[str, Any][source]

Get model metadata.

Returns:: Dictionary with model information.

class shorttext.generators.bow.GensimTopicModeling.RPModeler(preprocessor: callable | None = None, tokenizer: callable | None = None, toweigh: bool = True, normalize: bool = True)[source]

Bases: GensimTopicModeler, CompactIOMachine

Random Projection topic modeler with compact I/O support.

get_info() → dict[str, Any][source]

Get model metadata.

Returns:: Dictionary with model information.

shorttext.generators.bow.GensimTopicModeling.load_gensimtopicmodel(name: str, preprocessor: callable | None = None, tokenizer: callable | None = None, compact: bool = True) → GensimTopicModeler[source]: Deprecated. Use ~GensimTopicModeler.from_pretrained.

Deprecated since version 4.0.1: This will be removed in 5.0.0.

Classification Using Cosine Similarity

The topic modelers are trained to represent the short text in terms of a topic vector, effectively the feature vector. However, to perform supervised classification, there needs a classification algorithm. The first one is to calculate the cosine similarities between topic vectors of the given short text with those of the texts in all class labels.

If there is already a trained topic modeler, whether it is shorttext.generators.GensimTopicModeler or shorttext.generators.AutoencodingTopicModeler, a classifier based on cosine similarities can be initiated immediately without training. Taking the LDA example above, such classifier can be initiated as follow:

>>> cos_classifier = shorttext.classifiers.TopicVectorCosineDistanceClassifier(topicmodeler)

Or if the user already saved the topic modeler, one can initiate the same classifier by loading the topic modeler:

>>>import shorttext.classifiers cos_classifier = shorttext.classifiers.TopicVecCosineDistanceClassifierload_gensimtopicvec_cosineClassifier(‘/path/to/nihlda128.bin’)

To perform prediction, enter:

>>> cos_classifier.score('stem cell research')

which outputs a dictionary with labels and the corresponding scores.

The same thing for autoencoder, but the classifier based on autoencoder can be loaded by another function:

>>> cos_classifier = shorttext.classifiers.load_autoencoder_cosineClassifier('/path/to/sub_autoencoder8.bin')

class shorttext.classifiers.bow.topic.TopicVectorDistanceClassification.TopicVecCosineDistanceClassifier(topicmodeler: LatentTopicModeler)[source]

Bases: AbstractScorer

Classifier using cosine similarity with topic vectors.

Classifies short text based on cosine similarity between topic vectors of the input and class centroids. Topic vectors are generated by a LatentTopicModeler.

__init__(topicmodeler: LatentTopicModeler)[source]

Initialize the classifier.

Args:: topicmodeler: A LatentTopicModeler instance.

score(shorttext: str) → dict[str, float][source]

Calculate cosine similarity to all class topic vectors.

Args:: shorttext: Input text.
Returns:: Dictionary mapping class labels to similarity scores.

loadmodel(nameprefix: str) → None[source]

Load the topic model.

Args:: nameprefix: Prefix for input files.

savemodel(nameprefix: str) → None[source]

Save the topic model.

Args:: nameprefix: Prefix for output files.
Raises:: ModelNotTrainedException: If model not trained.

load_compact_model(name: str) → None[source]

Load compact model.

Args:: name: Name of the compact model file.

save_compact_model(name: str) → None[source]

Save compact model.

Args:: name: Name of the compact model file.

classmethod from_pretrained_gensimtopic(name: str, preprocessor: callable | None = None, tokenizer: callable | None = None, compact: bool = True) → Self[source]

Load a gensim topic model and return a cosine classifier.

Args:: name: Model name (compact) or file prefix (non-compact). preprocessor: Text preprocessing function. Default: standard_text_preprocessor_1. compact: Whether to load compact model. Default: True.
Returns:: TopicVecCosineDistanceClassifier instance.

classmethod from_pretrained_autoencoder(name: str, preprocessor: callable | None = None, tokenizer: callable | None = None, compact: bool = True) → Self[source]

shorttext.classifiers.bow.topic.TopicVectorDistanceClassification.train_gensimtopicvec_cosineClassifier(classdict: dict[str, list[str]], nb_topics: int, preprocessor: callable | None = None, tokenizer: callable | None = None, algorithm: Literal['lda', 'lsi', 'rp'] = 'lda', toweigh: bool = True, normalize: bool = True, *args, **kwargs) → TopicVecCosineDistanceClassifier[source]

Train a gensim topic model and return a cosine classifier.

Args:: classdict: Training data. nb_topics: Number of latent topics. preprocessor: Text preprocessing function. Default: standard_text_preprocessor_1. algorithm: Topic modeling algorithm. Options: lda, lsi, rp. Default: lda. toweigh: Whether to apply tf-idf weighting. Default: True. normalize: Whether to normalize topic vectors. Default: True. *args: Additional arguments for gensim topic model. **kwargs: Additional keyword arguments for gensim topic model.
Returns:: TopicVecCosineDistanceClassifier instance.

shorttext.classifiers.bow.topic.TopicVectorDistanceClassification.load_gensimtopicvec_cosineClassifier(name: str, preprocessor: callable | None = None, tokenizer: callable | None = None, compact: bool = True) → TopicVecCosineDistanceClassifier[source]: Deprecated. Use ~TopicVecCosineDistanceClassifier.from_pretrained_gensimtopic.

Deprecated since version 4.0.1: This will be removed in 5.0.0.

shorttext.classifiers.bow.topic.TopicVectorDistanceClassification.train_autoencoder_cosineClassifier(classdict: dict[str, list[str]], nb_topics: int, preprocessor: callable | None = None, tokenizer: callable | None = None, normalize: bool = True, *args, **kwargs) → TopicVecCosineDistanceClassifier[source]

Train an autoencoder topic model and return a cosine classifier.

Args:: classdict: Training data. nb_topics: Number of topics (encoding dimensions). preprocessor: Text preprocessing function. Default: standard_text_preprocessor_1. normalize: Whether to normalize topic vectors. Default: True. *args: Additional arguments for Keras model fitting. **kwargs: Additional keyword arguments for Keras model fitting.
Returns:: TopicVecCosineDistanceClassifier instance.

shorttext.classifiers.bow.topic.TopicVectorDistanceClassification.load_autoencoder_cosineClassifier(name: str, preprocessor: callable | None = None, tokenizer: callable | None = None, compact: bool = True) → TopicVecCosineDistanceClassifier[source]: Deprecated. Use ~TopicVecCosineDistanceClassifier.from_pretrained_autoencoder

Classification Using Scikit-Learn Classifiers

The topic modeler can be used to generate features used for other machine learning algorithms. We can take any supervised learning algorithms in scikit-learn here. We use Gaussian naive Bayes as an example. For faster demonstration, use the subject keywords as the example dataset.

>>> subtopicmodeler = shorttext.generators.LDAModeler()
>>> subtopicmodeler.train(subdict, 8)

We first import the class:

>>> from sklearn.naive_bayes import GaussianNB

And we train the classifier:

>>> classifier = shorttext.classifiers.TopicVectorSkLearnClassifier(subtopicmodeler, GaussianNB())
>>> classifier.train(subdict)

Predictions can be performed like the following example:

>>> classifier.score('functional integral')

which outputs a dictionary with labels and the corresponding scores.

You can save the model by:

>>> classifier.save_compact_model('/path/to/sublda8nb.bin')

where the argument specifies the prefix of the path of the model files, including the topic models, and the scikit-learn model files. The classifier can be loaded by calling:

>>> classifier2 = shorttext.classifiers.load_gensim_topicvec_sklearnclassifier('/path/to/sublda8nb.bin')

The topic modeler here can also be an autoencoder, by putting subtopicmodeler as the autoencoder will still do the work. However, to load the saved classifier with an autoencoder model, do

>>> classifier2 = shorttext.classifiers.load_autoencoder_topic_sklearnclassifier('/path/to/filename.bin')

Compact model files saved by TopicVectorSkLearnClassifier in shorttext >= 1.0.0 cannot be read by earlier version of shorttext; vice versa is not true though: old compact model files can be read in.

class shorttext.classifiers.bow.topic.SkLearnClassification.TopicVectorSkLearnClassifier(topicmodeler: LatentTopicModeler, sklearn_classifier: BaseEstimator)[source]

Bases: AbstractScorer

Classifier using topic vectors with scikit-learn.

Wraps any scikit-learn supervised learning algorithm and uses topic vectors from LatentTopicModeler as features.

Reference:

Xuan Hieu Phan et al., “A Hidden Topic-Based Framework toward Building Applications with Short Web Documents,” IEEE Trans. Knowl. Data Eng. 23(7): 961-976 (2011).

Xuan Hieu Phan et al., “Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections,” WWW 2008. http://dl.acm.org/citation.cfm?id=1367510

__init__(topicmodeler: LatentTopicModeler, sklearn_classifier: BaseEstimator)[source]

Initialize the classifier.

Args:: topicmodeler: A topic modeler instance. sklearn_classifier: A scikit-learn classifier instance.

train(classdict: dict[str, list[str]], *args, **kwargs) → None[source]

Train the classifier.

Args:: classdict: Training data with class labels as keys and texts as values. *args: Arguments passed to scikit-learn classifier fit(). **kwargs: Arguments passed to scikit-learn classifier fit().
Raises:: ModelNotTrainedException: If topic modeler is not trained.

getvector(shorttext: str) → ndarray[tuple[Any, ...], dtype[float64]][source]

Get topic vector for short text.

Args:: shorttext: Input text.
Returns:: Topic vector representation.
Raises:: ModelNotTrainedException: If model not trained.

classify(shorttext: str) → str[source]

Classify short text into a class label.

Args:: shorttext: Input text to classify.
Returns:: Predicted class label.
Raises:: ModelNotTrainedException: If model not trained.

score(shorttext: str) → dict[str, float][source]

Compute classification scores for all classes.

Args:: shorttext: Input text.
Returns:: Dictionary mapping class labels to scores.
Raises:: ModelNotTrainedException: If model not trained.

savemodel(nameprefix: str) → None[source]

Save model to files.

Saves the topic model, scikit-learn classifier, and class labels.

Args:: nameprefix: Prefix for output files.
Raises:: ModelNotTrainedException: If model not trained.

loadmodel(nameprefix: str) → None[source]

Load model from files.

Args:: nameprefix: Prefix for input files.

save_compact_model(name: str) → None[source]

Save model as compact file.

Args:: name: Name of the compact model file.
Raises:: ModelNotTrainedException: If model not trained.

load_compact_model(name: str) → None[source]

Load model from compact file.

Args:: name: Name of the compact model file.

classmethod from_pretrained_gensimtopic_sklearnclassifier(name: str, preprocessor: callable | None = None, compact: bool = True) → Self[source]

Load a classifier with gensim topic vectors from files.

Args:

name: Model name (compact) or file prefix (non-compact). preprocessor: Text preprocessing function. Default: standard_text_preprocessor_1. compact: Load compact model. Default: True.

Returns:

TopicVectorSkLearnClassifier instance.

Reference:

Xuan Hieu Phan et al., “A Hidden Topic-Based Framework toward Building Applications with Short Web Documents,” IEEE Trans. Knowl. Data Eng. 23(7): 961-976 (2011).

Xuan Hieu Phan et al., “Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections,” WWW 2008. http://dl.acm.org/citation.cfm?id=1367510

classmethod from_pretrained_autoencoder_sklearnclassifier(name: str, preprocessor: callable | None = None, compact: bool = True) → Self[source]

Load a classifier with autoencoder topic vectors from files.

Args:

name: Model name (compact) or file prefix (non-compact). preprocessor: Text preprocessing function. Default: standard_text_preprocessor_1. compact: Load compact model. Default: True.

Returns:

TopicVectorSkLearnClassifier instance.

Reference:

Xuan Hieu Phan et al., “A Hidden Topic-Based Framework toward Building Applications with Short Web Documents,” IEEE Trans. Knowl. Data Eng. 23(7): 961-976 (2011).

Xuan Hieu Phan et al., “Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections,” WWW 2008. http://dl.acm.org/citation.cfm?id=1367510

shorttext.classifiers.bow.topic.SkLearnClassification.train_gensim_topicvec_sklearnclassifier(classdict: dict[str, list[str]], nb_topics: int, sklearn_classifier: BaseEstimator, preprocessor: callable | None = None, topicmodel_algorithm: Literal['lda', 'lsi', 'rp'] = 'lda', toweigh: bool = True, normalize: bool = True, gensim_paramdict: dict | None = None, sklearn_paramdict: dict | None = None) → TopicVectorSkLearnClassifier[source]

Train a classifier with gensim topic vectors and scikit-learn.

Trains a topic model (LDA, LSI, or RP), then uses the topic vectors as features to train a scikit-learn classifier.

Args:

classdict: Training data. nb_topics: Number of topics. sklearn_classifier: Scikit-learn classifier instance (not trained). preprocessor: Text preprocessing function. Default: standard_text_preprocessor_1. topicmodel_algorithm: Topic model algorithm. Default: lda. toweigh: Apply tf-idf weighting. Default: True. normalize: Normalize topic vectors. Default: True. gensim_paramdict: Arguments for gensim topic model. sklearn_paramdict: Arguments for scikit-learn classifier.

Returns:

Trained TopicVectorSkLearnClassifier.

Reference:

Xuan Hieu Phan et al., “A Hidden Topic-Based Framework toward Building Applications with Short Web Documents,” IEEE Trans. Knowl. Data Eng. 23(7): 961-976 (2011).

Xuan Hieu Phan et al., “Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections,” WWW 2008. http://dl.acm.org/citation.cfm?id=1367510

shorttext.classifiers.bow.topic.SkLearnClassification.load_gensim_topicvec_sklearnclassifier(name: str, preprocessor: callable | None = None, compact: bool = True) → TopicVectorSkLearnClassifier[source]: Deprecated. Use ~TopicVectorSkLearnClassifier.from_pretrained_gensimtopic_sklearnclassifier.

Deprecated since version 4.0.1: This will be removed in 5.0.0.

shorttext.classifiers.bow.topic.SkLearnClassification.train_autoencoder_topic_sklearnclassifier(classdict: dict[str, list[str]], nb_topics: int, sklearn_classifier: BaseEstimator, preprocessor: callable | None = None, normalize: bool = True, keras_paramdict: dict | None = None, sklearn_paramdict: dict | None = None) → TopicVectorSkLearnClassifier[source]

Train a classifier with autoencoder topic vectors and scikit-learn.

Trains an autoencoder topic model, then uses the encoded vectors as features to train a scikit-learn classifier.

Args:

classdict: Training data. nb_topics: Number of encoding dimensions. sklearn_classifier: Scikit-learn classifier instance (not trained). preprocessor: Text preprocessing function. Default: standard_text_preprocessor_1. normalize: Normalize topic vectors. Default: True. keras_paramdict: Arguments for Keras autoencoder training. sklearn_paramdict: Arguments for scikit-learn classifier.

Returns:

Trained TopicVectorSkLearnClassifier.

Reference:

Xuan Hieu Phan et al., “A Hidden Topic-Based Framework toward Building Applications with Short Web Documents,” IEEE Trans. Knowl. Data Eng. 23(7): 961-976 (2011).

Xuan Hieu Phan et al., “Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections,” WWW 2008. http://dl.acm.org/citation.cfm?id=1367510

shorttext.classifiers.bow.topic.SkLearnClassification.load_autoencoder_topic_sklearnclassifier(name: str, preprocessor: callable | None = None, compact: bool = True) → TopicVectorSkLearnClassifier[source]: Deprecated. Use ~TopicVectorSkLearnClassifier.from_pretrained_autoencoder_sklearnclassifier.

Deprecated since version 4.0.1: This will be removed in 5.0.0.

Notes about Text Preprocessing

The topic models are based on bag-of-words model, and text preprocessing is very important. However, the text preprocessing step cannot be serialized. The users should keep track of the text preprocessing step on their own. Unless it is necessary, use the standard preprocessing.

See more: Text Preprocessing .

Reference

David M. Blei, “Probabilistic Topic Models,” Communications of the ACM 55(4): 77-84 (2012).

Francois Chollet, “Building Autoencoders in Keras,” The Keras Blog. [Keras]

Xuan Hieu Phan, Cam-Tu Nguyen, Dieu-Thu Le, Minh Le Nguyen, Susumu Horiguchi, Quang-Thuy Ha, “A Hidden Topic-Based Framework toward Building Applications with Short Web Documents,” IEEE Trans. Knowl. Data Eng. 23(7): 961-976 (2011).

Xuan Hieu Phan, Le-Minh Nguyen, Susumu Horiguchi, “Learning to Classify Short and Sparse Text & Web withHidden Topics from Large-scale Data Collections,” WWW ‘08 Proceedings of the 17th international conference on World Wide Web. (2008) [ACL]

Home: Homepage of shorttext