API

Complete API reference for the shorttext library.

Top-Level Modules

shorttext.smartload.smartload_compact_model(filename: str | PathLike, wvmodel: gensim.models.keyedvectors.KeyedVectors | None, preprocessor: callable | None = None, vecsize: int | None = None)[source]

Load a classifier or model from a compact file.

Automatically detects the model type and loads the appropriate classifier. Set wvmodel to None if no word embedding model is needed.

Args:: filename: Path to the compact model file. wvmodel: Word embedding model. Can be None for non-embedding models. preprocessor: Text preprocessing function. Default: standard_text_preprocessor_1. vecsize: Vector size. Default: None (extracted from model).
Returns:: Appropriate classifier or model instance.
Raises:: AlgorithmNotExistException: If model type is unknown.

Classifiers

Base Classifier

class shorttext.classifiers.base.AbstractScorer[source]

Bases: ABC

Abstract base class for scoring classifiers.

abstractmethod score(shorttext: str) → dict[str, float][source]

Calculate classification scores.

Args:: shorttext: Input text to classify.
Returns:: Dictionary mapping class labels to scores.

Bag-of-Words Classifiers

class shorttext.classifiers.bow.topic.SkLearnClassification.TopicVectorSkLearnClassifier(topicmodeler: LatentTopicModeler, sklearn_classifier: BaseEstimator)[source]

Bases: AbstractScorer

Classifier using topic vectors with scikit-learn.

Wraps any scikit-learn supervised learning algorithm and uses topic vectors from LatentTopicModeler as features.

Reference:

Xuan Hieu Phan et al., “A Hidden Topic-Based Framework toward Building Applications with Short Web Documents,” IEEE Trans. Knowl. Data Eng. 23(7): 961-976 (2011).

Xuan Hieu Phan et al., “Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections,” WWW 2008. http://dl.acm.org/citation.cfm?id=1367510

__init__(topicmodeler: LatentTopicModeler, sklearn_classifier: BaseEstimator)[source]

Initialize the classifier.

Args:: topicmodeler: A topic modeler instance. sklearn_classifier: A scikit-learn classifier instance.

train(classdict: dict[str, list[str]], *args, **kwargs) → None[source]

Train the classifier.

Args:: classdict: Training data with class labels as keys and texts as values. *args: Arguments passed to scikit-learn classifier fit(). **kwargs: Arguments passed to scikit-learn classifier fit().
Raises:: ModelNotTrainedException: If topic modeler is not trained.

getvector(shorttext: str) → ndarray[tuple[Any, ...], dtype[float64]][source]

Get topic vector for short text.

Args:: shorttext: Input text.
Returns:: Topic vector representation.
Raises:: ModelNotTrainedException: If model not trained.

classify(shorttext: str) → str[source]

Classify short text into a class label.

Args:: shorttext: Input text to classify.
Returns:: Predicted class label.
Raises:: ModelNotTrainedException: If model not trained.

score(shorttext: str) → dict[str, float][source]

Compute classification scores for all classes.

Args:: shorttext: Input text.
Returns:: Dictionary mapping class labels to scores.
Raises:: ModelNotTrainedException: If model not trained.

savemodel(nameprefix: str) → None[source]

Save model to files.

Saves the topic model, scikit-learn classifier, and class labels.

Args:: nameprefix: Prefix for output files.
Raises:: ModelNotTrainedException: If model not trained.

loadmodel(nameprefix: str) → None[source]

Load model from files.

Args:: nameprefix: Prefix for input files.

save_compact_model(name: str) → None[source]

Save model as compact file.

Args:: name: Name of the compact model file.
Raises:: ModelNotTrainedException: If model not trained.

load_compact_model(name: str) → None[source]

Load model from compact file.

Args:: name: Name of the compact model file.

classmethod from_pretrained_gensimtopic_sklearnclassifier(name: str, preprocessor: callable | None = None, compact: bool = True) → Self[source]

Load a classifier with gensim topic vectors from files.

Args:

name: Model name (compact) or file prefix (non-compact). preprocessor: Text preprocessing function. Default: standard_text_preprocessor_1. compact: Load compact model. Default: True.

Returns:

TopicVectorSkLearnClassifier instance.

Reference:

Xuan Hieu Phan et al., “A Hidden Topic-Based Framework toward Building Applications with Short Web Documents,” IEEE Trans. Knowl. Data Eng. 23(7): 961-976 (2011).

Xuan Hieu Phan et al., “Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections,” WWW 2008. http://dl.acm.org/citation.cfm?id=1367510

classmethod from_pretrained_autoencoder_sklearnclassifier(name: str, preprocessor: callable | None = None, compact: bool = True) → Self[source]

Load a classifier with autoencoder topic vectors from files.

Args:

name: Model name (compact) or file prefix (non-compact). preprocessor: Text preprocessing function. Default: standard_text_preprocessor_1. compact: Load compact model. Default: True.

Returns:

TopicVectorSkLearnClassifier instance.

Reference:

Xuan Hieu Phan et al., “A Hidden Topic-Based Framework toward Building Applications with Short Web Documents,” IEEE Trans. Knowl. Data Eng. 23(7): 961-976 (2011).

Xuan Hieu Phan et al., “Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections,” WWW 2008. http://dl.acm.org/citation.cfm?id=1367510

shorttext.classifiers.bow.topic.SkLearnClassification.train_gensim_topicvec_sklearnclassifier(classdict: dict[str, list[str]], nb_topics: int, sklearn_classifier: BaseEstimator, preprocessor: callable | None = None, topicmodel_algorithm: Literal['lda', 'lsi', 'rp'] = 'lda', toweigh: bool = True, normalize: bool = True, gensim_paramdict: dict | None = None, sklearn_paramdict: dict | None = None) → TopicVectorSkLearnClassifier[source]

Train a classifier with gensim topic vectors and scikit-learn.

Trains a topic model (LDA, LSI, or RP), then uses the topic vectors as features to train a scikit-learn classifier.

Args:

classdict: Training data. nb_topics: Number of topics. sklearn_classifier: Scikit-learn classifier instance (not trained). preprocessor: Text preprocessing function. Default: standard_text_preprocessor_1. topicmodel_algorithm: Topic model algorithm. Default: lda. toweigh: Apply tf-idf weighting. Default: True. normalize: Normalize topic vectors. Default: True. gensim_paramdict: Arguments for gensim topic model. sklearn_paramdict: Arguments for scikit-learn classifier.

Returns:

Trained TopicVectorSkLearnClassifier.

Reference:

Xuan Hieu Phan et al., “A Hidden Topic-Based Framework toward Building Applications with Short Web Documents,” IEEE Trans. Knowl. Data Eng. 23(7): 961-976 (2011).

Xuan Hieu Phan et al., “Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections,” WWW 2008. http://dl.acm.org/citation.cfm?id=1367510

shorttext.classifiers.bow.topic.SkLearnClassification.load_gensim_topicvec_sklearnclassifier(name: str, preprocessor: callable | None = None, compact: bool = True) → TopicVectorSkLearnClassifier[source]: Deprecated. Use ~TopicVectorSkLearnClassifier.from_pretrained_gensimtopic_sklearnclassifier.

Deprecated since version 4.0.1: This will be removed in 5.0.0.

shorttext.classifiers.bow.topic.SkLearnClassification.train_autoencoder_topic_sklearnclassifier(classdict: dict[str, list[str]], nb_topics: int, sklearn_classifier: BaseEstimator, preprocessor: callable | None = None, normalize: bool = True, keras_paramdict: dict | None = None, sklearn_paramdict: dict | None = None) → TopicVectorSkLearnClassifier[source]

Train a classifier with autoencoder topic vectors and scikit-learn.

Trains an autoencoder topic model, then uses the encoded vectors as features to train a scikit-learn classifier.

Args:

classdict: Training data. nb_topics: Number of encoding dimensions. sklearn_classifier: Scikit-learn classifier instance (not trained). preprocessor: Text preprocessing function. Default: standard_text_preprocessor_1. normalize: Normalize topic vectors. Default: True. keras_paramdict: Arguments for Keras autoencoder training. sklearn_paramdict: Arguments for scikit-learn classifier.

Returns:

Trained TopicVectorSkLearnClassifier.

Reference:

Xuan Hieu Phan et al., “A Hidden Topic-Based Framework toward Building Applications with Short Web Documents,” IEEE Trans. Knowl. Data Eng. 23(7): 961-976 (2011).

Xuan Hieu Phan et al., “Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections,” WWW 2008. http://dl.acm.org/citation.cfm?id=1367510

shorttext.classifiers.bow.topic.SkLearnClassification.load_autoencoder_topic_sklearnclassifier(name: str, preprocessor: callable | None = None, compact: bool = True) → TopicVectorSkLearnClassifier[source]: Deprecated. Use ~TopicVectorSkLearnClassifier.from_pretrained_autoencoder_sklearnclassifier.

Deprecated since version 4.0.1: This will be removed in 5.0.0.

class shorttext.classifiers.bow.topic.TopicVectorDistanceClassification.TopicVecCosineDistanceClassifier(topicmodeler: LatentTopicModeler)[source]

Bases: AbstractScorer

Classifier using cosine similarity with topic vectors.

Classifies short text based on cosine similarity between topic vectors of the input and class centroids. Topic vectors are generated by a LatentTopicModeler.

__init__(topicmodeler: LatentTopicModeler)[source]

Initialize the classifier.

Args:: topicmodeler: A LatentTopicModeler instance.

score(shorttext: str) → dict[str, float][source]

Calculate cosine similarity to all class topic vectors.

Args:: shorttext: Input text.
Returns:: Dictionary mapping class labels to similarity scores.

loadmodel(nameprefix: str) → None[source]

Load the topic model.

Args:: nameprefix: Prefix for input files.

savemodel(nameprefix: str) → None[source]

Save the topic model.

Args:: nameprefix: Prefix for output files.
Raises:: ModelNotTrainedException: If model not trained.

load_compact_model(name: str) → None[source]

Load compact model.

Args:: name: Name of the compact model file.

save_compact_model(name: str) → None[source]

Save compact model.

Args:: name: Name of the compact model file.

classmethod from_pretrained_gensimtopic(name: str, preprocessor: callable | None = None, tokenizer: callable | None = None, compact: bool = True) → Self[source]

Load a gensim topic model and return a cosine classifier.

Args:: name: Model name (compact) or file prefix (non-compact). preprocessor: Text preprocessing function. Default: standard_text_preprocessor_1. compact: Whether to load compact model. Default: True.
Returns:: TopicVecCosineDistanceClassifier instance.

classmethod from_pretrained_autoencoder(name: str, preprocessor: callable | None = None, tokenizer: callable | None = None, compact: bool = True) → Self[source]

shorttext.classifiers.bow.topic.TopicVectorDistanceClassification.train_gensimtopicvec_cosineClassifier(classdict: dict[str, list[str]], nb_topics: int, preprocessor: callable | None = None, tokenizer: callable | None = None, algorithm: Literal['lda', 'lsi', 'rp'] = 'lda', toweigh: bool = True, normalize: bool = True, *args, **kwargs) → TopicVecCosineDistanceClassifier[source]

Train a gensim topic model and return a cosine classifier.

Args:: classdict: Training data. nb_topics: Number of latent topics. preprocessor: Text preprocessing function. Default: standard_text_preprocessor_1. algorithm: Topic modeling algorithm. Options: lda, lsi, rp. Default: lda. toweigh: Whether to apply tf-idf weighting. Default: True. normalize: Whether to normalize topic vectors. Default: True. *args: Additional arguments for gensim topic model. **kwargs: Additional keyword arguments for gensim topic model.
Returns:: TopicVecCosineDistanceClassifier instance.

shorttext.classifiers.bow.topic.TopicVectorDistanceClassification.load_gensimtopicvec_cosineClassifier(name: str, preprocessor: callable | None = None, tokenizer: callable | None = None, compact: bool = True) → TopicVecCosineDistanceClassifier[source]: Deprecated. Use ~TopicVecCosineDistanceClassifier.from_pretrained_gensimtopic.

Deprecated since version 4.0.1: This will be removed in 5.0.0.

shorttext.classifiers.bow.topic.TopicVectorDistanceClassification.train_autoencoder_cosineClassifier(classdict: dict[str, list[str]], nb_topics: int, preprocessor: callable | None = None, tokenizer: callable | None = None, normalize: bool = True, *args, **kwargs) → TopicVecCosineDistanceClassifier[source]

Train an autoencoder topic model and return a cosine classifier.

Args:: classdict: Training data. nb_topics: Number of topics (encoding dimensions). preprocessor: Text preprocessing function. Default: standard_text_preprocessor_1. normalize: Whether to normalize topic vectors. Default: True. *args: Additional arguments for Keras model fitting. **kwargs: Additional keyword arguments for Keras model fitting.
Returns:: TopicVecCosineDistanceClassifier instance.

shorttext.classifiers.bow.topic.TopicVectorDistanceClassification.load_autoencoder_cosineClassifier(name: str, preprocessor: callable | None = None, tokenizer: callable | None = None, compact: bool = True) → TopicVecCosineDistanceClassifier[source]: Deprecated. Use ~TopicVecCosineDistanceClassifier.from_pretrained_autoencoder

shorttext.classifiers.bow.maxent.MaxEntClassification.logistic_framework(nb_features: int, nb_outputs: int, l2reg: float = 0.01, bias_l2reg: float = 0.01, optimizer: Literal['sgd', 'rmsprop', 'adagrad', 'adadelta', 'adam', 'adamax', 'nadam'] = 'adam') → tensorflow.keras.Model[source]

Create a maximum entropy classifier neural network.

Args:: nb_features: Number of input features. nb_outputs: Number of output classes. l2reg: L2 regularization coefficient. Default: 0.01. bias_l2reg: L2 regularization for bias. Default: 0.01. optimizer: Optimizer. Options: sgd, rmsprop, adagrad, adadelta, adam, adamax, nadam. Default: adam.
Returns:: Keras Sequential model for maximum entropy classification.

class shorttext.classifiers.bow.maxent.MaxEntClassification.MaxEntClassifier(preprocessor: callable | None = None)[source]

Bases: AbstractScorer, CompactIOMachine

Maximum entropy classifier.

A classifier that implements the principle of maximum entropy for text categorization using bag-of-words features.

Reference:: Adam L. Berger et al., “A Maximum Entropy Approach to Natural Language Processing,” Computational Linguistics 22(1): 39-72 (1996).

__init__(preprocessor: callable | None = None)[source]

Initialize the classifier.

Args:: preprocessor: Text preprocessing function. Default: lowercase.

shorttext_to_vec(shorttext: str) → SparseArray[source]

Convert short text to sparse vector.

Args:: shorttext: Input text.
Returns:: Sparse vector representation.

train(classdict: dict[str, list[str]], nb_epochs: int = 500, l2reg: float = 0.01, bias_l2reg: float = 0.01, optimizer: Literal['sgd', 'rmsprop', 'adagrad', 'adadelta', 'adam', 'adamax', 'nadam'] = 'adam') → None[source]

Train the classifier.

Args:: classdict: Training data. nb_epochs: Number of training epochs. Default: 500. l2reg: L2 regularization coefficient. Default: 0.01. bias_l2reg: L2 regularization for bias. Default: 0.01. optimizer: Optimizer. Default: adam.

savemodel(nameprefix: str) → None[source]

Save the trained model to files.

Args:: nameprefix: Prefix for output files.
Raises:: ModelNotTrainedException: If not trained.

loadmodel(nameprefix: str) → None[source]

Load a trained model from files.

Args:: nameprefix: Prefix for input files.

score(shorttext: str) → dict[str, float][source]

Calculate classification scores for all class labels.

Args:: shorttext: Input text.
Returns:: Dictionary mapping class labels to scores.
Raises:: ModelNotTrainedException: If not trained.

classmethod from_pretrained(name: str, compact: bool = True) → Self[source]

Load a MaxEntClassifier from file.

Args:: name: Model name (compact) or file prefix (non-compact). compact: Whether to load compact model. Default: True.
Returns:: MaxEntClassifier instance.

shorttext.classifiers.bow.maxent.MaxEntClassification.load_maxent_classifier(name: str, compact: bool = True) → MaxEntClassifier[source]: Deprecated. Use MaxEntClassifier.from_pretrained.

Deprecated since version 4.0.1: This will be removed in 5.0.0.

Embedding-Based Classifiers

class shorttext.classifiers.embed.sumvec.SumEmbedVecClassification.SumEmbeddedVecClassifier(wvmodel: gensim.models.keyedvectors.KeyedVectors, vecsize: int | None = None, simfcn: callable | None = None)[source]

Bases: CompactIOMachine

Classifier using summed word embeddings.

Each class is represented as the sum of word embeddings for its training sentences, normalized to a unit vector. Prediction uses cosine similarity between the input vector and class centroids.

Reference:: Pre-trained Word2Vec: https://code.google.com/archive/p/word2vec/

__init__(wvmodel: gensim.models.keyedvectors.KeyedVectors, vecsize: int | None = None, simfcn: callable | None = None)[source]

Initialize the classifier.

Args:: wvmodel: Word embedding model (e.g., Word2Vec). vecsize: Vector size. Default: None (extracted from model). simfcn: Similarity function. Default: cosine_similarity.

train(classdict: dict[str, list[str]]) → None[source]

Train the classifier.

Args:: classdict: Training data with class labels as keys and texts as values.
Raises:: ModelNotTrainedException: If not trained or loaded.

savemodel(nameprefix: str) → None[source]

Save the trained model.

Args:: nameprefix: Prefix for output files.
Raises:: ModelNotTrainedException: If not trained.

loadmodel(nameprefix: str) → None[source]

Load a trained model.

Args:: nameprefix: Prefix for input files.

shorttext_to_embedvec(shorttext: str) → Annotated[ndarray[tuple[Any, ...], dtype[float64]], '1D Array'][source]

Convert short text to embedding vector.

Args:: shorttext: Input text.
Returns:: Normalized embedding vector.

score(shorttext: str) → dict[str, float][source]

Calculate classification scores for all class labels.

Args:: shorttext: Input text.
Returns:: Dictionary mapping class labels to scores.
Raises:: ModelNotTrainedException: If not trained.

classmethod from_pretrained(wvmodel: gensim.models.keyedvectors.KeyedVectors, name: str, compact: bool = True, vecsize: int | None = None) → Self[source]

Load a SumEmbeddedVecClassifier from file.

Args:: wvmodel: Word embedding model. name: Model name (compact) or prefix (non-compact). compact: Whether to load compact model. Default: True. vecsize: Vector size. Default: None.
Returns:: SumEmbeddedVecClassifier instance.

shorttext.classifiers.embed.sumvec.SumEmbedVecClassification.load_sumword2vec_classifier(wvmodel: gensim.models.keyedvectors.KeyedVectors, name: str, compact: bool = True, vecsize: int | None = None) → SumEmbeddedVecClassifier[source]: Deprecated. Use ~SumEmbeddedVecClassifier.from_pretrained.

Deprecated since version 4.0.1: This will be removed in 5.0.0.

class shorttext.classifiers.embed.sumvec.VarNNSumEmbedVecClassification.VarNNSumEmbeddedVecClassifier(wvmodel: gensim.models.keyedvectors.KeyedVectors, vecsize: int | None = None, maxlen: int = 15)[source]

Bases: AbstractScorer, CompactIOMachine

Neural network classifier using summed embeddings.

Wraps Keras neural network models for supervised short text classification. Each token is converted to an embedded vector using a pre-trained word-embedding model. The sentence embedding is the sum of token embeddings, normalized to a unit vector.

The neural network model must be a Keras Sequential model with output dimension matching the number of class labels.

Reference:: Pre-trained Word2Vec: https://code.google.com/archive/p/word2vec/ Example models available in the frameworks module.

__init__(wvmodel: gensim.models.keyedvectors.KeyedVectors, vecsize: int | None = None, maxlen: int = 15)[source]

Initialize the classifier.

Args:: wvmodel: Word embedding model (e.g., Word2Vec). vecsize: Vector size. Default: None (extracted from model). maxlen: Maximum number of words per sentence. Default: 15.

convert_traindata_embedvecs(classdict: dict[str, list[str]]) → tuple[list[str], Annotated[ndarray[tuple[Any, ...], dtype[float64]], '2D Array'], Annotated[ndarray[tuple[Any, ...], dtype[int64]], '2D Array']][source]

Convert training data to embedded vectors.

Converts each short text into a normalized sum of word embeddings.

Args:: classdict: Training data with class labels as keys and texts as values.
Returns:: Tuple of (class_labels, embedding_matrix, labels_array).

train(classdict: dict[str, list[str]], kerasmodel: tensorflow.keras.models.Model, nb_epoch: int = 10) → None[source]

Train the classifier.

Args:: classdict: Training data. kerasmodel: Keras Sequential model. nb_epoch: Number of training epochs. Default: 10.
Raises:: ModelNotTrainedException: If not trained or loaded.

savemodel(nameprefix: str) → None[source]

Save the trained model to files.

Args:: nameprefix: Prefix for output files.
Raises:: ModelNotTrainedException: If not trained.

loadmodel(nameprefix: str) → None[source]

Load a trained model from files.

Args:: nameprefix: Prefix for input files.

word_to_embedvec(word: str) → Annotated[ndarray[tuple[Any, ...], dtype[float64]], '1D Array'][source]

Convert a word to its embedding vector.

Args:: word: Input word.
Returns:: Embedding vector. Returns zeros if word not in vocabulary.

shorttext_to_embedvec(shorttext: str) → Annotated[ndarray[tuple[Any, ...], dtype[float64]], '1D Array'][source]

Convert short text to embedding vector.

Sums token embeddings and normalizes to unit vector.

Args:: shorttext: Input text.
Returns:: Normalized embedding vector.

score(shorttext: str) → dict[str, float][source]

Calculate classification scores for all class labels.

Args:: shorttext: Input text.
Returns:: Dictionary mapping class labels to scores.
Raises:: ModelNotTrainedException: If not trained.

classmethod from_pretrained(wvmodel: gensim.models.keyedvectors.KeyedVectors, name: str, compact: bool = True, vecsize: int | None = None) → Self[source]

Load a VarNNSumEmbeddedVecClassifier from file.

Args:: wvmodel: Word embedding model. name: Model name (compact) or file prefix (non-compact). compact: Whether to load compact model. Default: True. vecsize: Vector size. Default: None.
Returns:: VarNNSumEmbeddedVecClassifier instance.

shorttext.classifiers.embed.sumvec.VarNNSumEmbedVecClassification.load_varnnsumvec_classifier(wvmodel: gensim.models.keyedvectors.KeyedVectors, name: str, compact: bool = True, vecsize: int | None = None) → VarNNSumEmbeddedVecClassifier[source]: Deprecated. Use ~VarNNSumEmbeddedVecClassifier.from_pretrained.

Deprecated since version 4.0.1: This will be removed in 5.0.0.

shorttext.classifiers.embed.sumvec.frameworks.DenseWordEmbed(nb_labels: int, dense_nb_nodes: list[int] | None = None, dense_actfcn: Literal['softplus', 'softsign', 'relu', 'tanh', 'sigmoid', 'hard_sigmoid', 'linear'] | None = None, vecsize: int = 300, reg_coef: float = 0.1, final_activiation: Literal['softplus', 'softsign', 'relu', 'tanh', 'sigmoid', 'hard_sigmoid', 'linear'] = 'softmax', optimizer: Literal['sgd', 'rmsprop', 'adagrad', 'adadelta', 'adam', 'adamax', 'nadam'] = 'adam') → tensorflow.keras.models.Model[source]

Create a dense neural network for embedding-based classification.

Args:: nb_labels: Number of class labels. dense_nb_nodes: Nodes per layer. Default: []. dense_actfcn: Activation functions per layer. Default: []. vecsize: Embedding vector size. Default: 300. reg_coef: L2 regularization coefficient. Default: 0.1. final_activiation: Final layer activation. Default: softmax. optimizer: Optimizer. Default: adam.
Returns:: Keras Sequential model.
Raises:: UnequalArrayLengthsException: If dense_nb_nodes and dense_actfcn have different lengths.

class shorttext.classifiers.embed.nnlib.VarNNEmbedVecClassification.VarNNEmbeddedVecClassifier(wvmodel: gensim.models.keyedvectors.KeyedVectors, vecsize: int | None = None, maxlen: int = 15, with_gensim: bool = False)[source]

Bases: AbstractScorer, CompactIOMachine

Neural network classifier for short text categorization.

Wraps Keras neural network models for supervised short text classification. Each token is converted to an embedded vector using a pre-trained word-embedding model (e.g., Word2Vec). Sentences are represented as matrices (rank-2 or rank-3 arrays) and processed by the neural network.

The neural network model must be a Keras Sequential model with output dimension matching the number of class labels.

Reference:: Pre-trained Word2Vec: https://code.google.com/archive/p/word2vec/ Example models available in the frameworks module.

__init__(wvmodel: gensim.models.keyedvectors.KeyedVectors, vecsize: int | None = None, maxlen: int = 15, with_gensim: bool = False)[source]

Initialize the classifier.

Args:: wvmodel: Word embedding model (e.g., Word2Vec). vecsize: Vector size. Default: None (extracted from model). maxlen: Maximum number of words per sentence. Default: 15. with_gensim: Whether to use gensim format. Default: False.

convert_trainingdata_matrix(classdict: dict[str, list[str]]) → tuple[list[str], Annotated[ndarray[tuple[Any, ...], dtype[float64]], '3D Array'], Annotated[ndarray[tuple[Any, ...], dtype[int64]], '2D Array']][source]

Convert training data to neural network input format.

Args:: classdict: Training data with class labels as keys and texts as values.
Returns:: Tuple of (class_labels, embedded_vectors, labels_array).

train(classdict: dict[str, list[str]], kerasmodel: tensorflow.keras.models.Model, nb_epoch: int = 10)[source]

Train the classifier.

Args:: classdict: Training data. kerasmodel: Keras Sequential model. nb_epoch: Number of training epochs. Default: 10.
Raises:: ModelNotTrainedException: If model not loaded.

savemodel(nameprefix: str) → None[source]

Save the trained model to files.

Args:: nameprefix: Prefix for output files.
Raises:: ModelNotTrainedException: If not trained.

loadmodel(nameprefix: str) → None[source]

Load a trained model from files.

Args:: nameprefix: Prefix for input files.

word_to_embedvec(word: str) → ndarray[tuple[Any, ...], dtype[float64]][source]

Convert a word to its embedding vector.

Args:: word: Input word.
Returns:: Embedding vector. Returns zeros if word not in vocabulary.

shorttext_to_matrix(shorttext: str) → Annotated[ndarray[tuple[Any, ...], dtype[float64]], '2D Array'][source]

Convert short text to embedding matrix.

Args:: shorttext: Input text.
Returns:: Matrix of shape (maxlen, vecsize) with embedding vectors.

score(shorttext: str, model_params: dict[str, Any] | None = None) → dict[str, float][source]

Calculate classification scores for all class labels.

Args:: shorttext: Input text. model_params: Additional parameters for model prediction.
Returns:: Dictionary mapping class labels to scores.
Raises:: ModelNotTrainedException: If not trained.

classmethod from_pretrained(wvmodel: gensim.models.keyedvectors.KeyedVectors, name: str, compact: bool = True, vecsize: int | None = None) → Self[source]

Load a VarNNEmbeddedVecClassifier from file.

Args:: wvmodel: Word embedding model. name: Model name (compact) or file prefix (non-compact). compact: Whether to load compact model. Default: True. vecsize: Vector size. Default: None.
Returns:: VarNNEmbeddedVecClassifier instance.

shorttext.classifiers.embed.nnlib.VarNNEmbedVecClassification.load_varnnlibvec_classifier(wvmodel: gensim.models.keyedvectors.KeyedVectors, name: str, compact: bool = True, vecsize: int | None = None) → VarNNEmbeddedVecClassifier[source]: Deprecated. Use ~VarNNEmbeddedVecClassifier.from_pretrained.

shorttext.classifiers.embed.nnlib.frameworks.CNNWordEmbed(nb_labels: int, wvmodel: gensim.models.keyedvectors.KeyedVectors | None = None, nb_filters: int = 1200, n_gram: int = 2, maxlen: int = 15, vecsize: int = 300, cnn_dropout: float = 0.0, final_activation: Literal['softplus', 'softsign', 'relu', 'tanh', 'sigmoid', 'hard_sigmoid', 'linear'] = 'softmax', dense_wl2reg: float = 0.0, dense_bl2reg: float = 0.0, optimizer: Literal['sgd', 'rmsprop', 'adagrad', 'adadelta', 'adam', 'adamax', 'nadam'] = 'adam') → tensorflow.keras.models.Model[source]

Create a CNN for word embeddings.

Args:: nb_labels: Number of class labels. wvmodel: Word embedding model. If provided, vecsize is extracted from it. nb_filters: Number of filters. Default: 1200. n_gram: N-gram (window size). Default: 2. maxlen: Maximum sentence length. Default: 15. vecsize: Embedding vector size. Default: 300. cnn_dropout: CNN dropout rate. Default: 0.0. final_activation: Final layer activation. Default: softmax. dense_wl2reg: L2 regularization for weights. Default: 0.0. dense_bl2reg: L2 regularization for bias. Default: 0.0. optimizer: Optimizer. Default: adam.
Returns:: Keras Sequential model.
Reference:: Yoon Kim, “Convolutional Neural Networks for Sentence Classification,” EMNLP 2014 (arXiv:1408.5882). https://arxiv.org/abs/1408.5882

shorttext.classifiers.embed.nnlib.frameworks.DoubleCNNWordEmbed(nb_labels: int, wvmodel: gensim.models.keyedvectors.KeyedVectors | None = None, nb_filters_1: int = 1200, nb_filters_2: int = 600, n_gram: int = 2, filter_length_2: int = 10, maxlen: int = 15, vecsize: int = 300, cnn_dropout_1: float = 0.0, cnn_dropout_2: float = 0.0, final_activation: Literal['softplus', 'softsign', 'relu', 'tanh', 'sigmoid', 'hard_sigmoid', 'linear'] = 'softmax', dense_wl2reg: float = 0.0, dense_bl2reg: float = 0.0, optimizer: Literal['sgd', 'rmsprop', 'adagrad', 'adadelta', 'adam', 'adamax', 'nadam'] = 'adam') → tensorflow.keras.models.Model[source]

Create a double-layer CNN for word embeddings.

Args:: nb_labels: Number of class labels. wvmodel: Word embedding model. If provided, vecsize is extracted from it. nb_filters_1: Filters for first layer. Default: 1200. nb_filters_2: Filters for second layer. Default: 600. n_gram: N-gram for first layer. Default: 2. filter_length_2: Window size for second layer. Default: 10. maxlen: Maximum sentence length. Default: 15. vecsize: Embedding vector size. Default: 300. cnn_dropout_1: Dropout for first layer. Default: 0.0. cnn_dropout_2: Dropout for second layer. Default: 0.0. final_activation: Final layer activation. Default: softmax. dense_wl2reg: L2 regularization for weights. Default: 0.0. dense_bl2reg: L2 regularization for bias. Default: 0.0. optimizer: Optimizer. Default: adam.
Returns:: Keras Sequential model.

shorttext.classifiers.embed.nnlib.frameworks.CLSTMWordEmbed(nb_labels: int, wvmodel: gensim.models.keyedvectors.KeyedVectors | None = None, nb_filters: int = 1200, n_gram: int = 2, maxlen: int = 15, vecsize: int = 300, cnn_dropout: float = 0.0, nb_rnnoutdim: int = 1200, rnn_dropout: int = 0.2, final_activation: Literal['softplus', 'softsign', 'relu', 'tanh', 'sigmoid', 'hard_sigmoid', 'linear'] = 'softmax', dense_wl2reg: float = 0.0, dense_bl2reg: float = 0.0, optimizer: Literal['sgd', 'rmsprop', 'adagrad', 'adadelta', 'adam', 'adamax', 'nadam'] = 'adam') → tensorflow.keras.models.Model[source]

Create a C-LSTM model for word embeddings.

Args:: nb_labels: Number of class labels. wvmodel: Word embedding model. If provided, vecsize is extracted from it. nb_filters: Number of CNN filters. Default: 1200. n_gram: N-gram (window size). Default: 2. maxlen: Maximum sentence length. Default: 15. vecsize: Embedding vector size. Default: 300. cnn_dropout: CNN dropout rate. Default: 0.0. nb_rnnoutdim: LSTM output dimension. Default: 1200. rnn_dropout: LSTM dropout rate. Default: 0.2. final_activation: Final layer activation. Default: softmax. dense_wl2reg: L2 regularization for weights. Default: 0.0. dense_bl2reg: L2 regularization for bias. Default: 0.0. optimizer: Optimizer. Default: adam.
Returns:: Keras Sequential model.
Reference:: Chunting Zhou et al., “A C-LSTM Neural Network for Text Classification,” arXiv:1511.08630 (2015). https://arxiv.org/abs/1511.08630

Generators

Bag-of-Words Generators

class shorttext.generators.bow.GensimTopicModeling.GensimTopicModeler(preprocessor: callable | None = None, tokenizer: callable | None = None, algorithm: Literal['lda', 'lsi', 'rp'] = 'lda', toweigh: bool = True, normalize: bool = True)[source]

Bases: LatentTopicModeler

Topic modeler using gensim implementations.

Supports LDA (Latent Dirichlet Allocation), LSI (Latent Semantic Indexing), and Random Projections (RP) for topic modeling.

Note:: For compact model I/O, use LDAModeler or LSIModeler instead.

__init__(preprocessor: callable | None = None, tokenizer: callable | None = None, algorithm: Literal['lda', 'lsi', 'rp'] = 'lda', toweigh: bool = True, normalize: bool = True)[source]

Initialize the topic modeler.

Args:: preprocessor: Text preprocessing function. Default: standard_text_preprocessor_1. algorithm: Topic modeling algorithm. Options: ‘lda’, ‘lsi’, ‘rp’. Default: ‘lda’. toweigh: Whether to apply tf-idf weighting. Default: True. normalize: Whether to normalize topic vectors. Default: True.

generate_corpus(classdict: dict[str, list[str]]) → None[source]

Generate gensim dictionary and corpus.

Args:: classdict: Training data.

train(classdict: dict[str, list[str]], nb_topics: int, *args, **kwargs) → None[source]

Train the topic modeler.

Args:: classdict: Training data with class labels as keys and texts as values. nb_topics: Number of latent topics. *args: Arguments for the gensim topic model. **kwargs: Keyword arguments for the gensim topic model.

update(additional_classdict: dict[str, list[str]]) → None[source]

Update model with additional data.

Warning: Does not support adding new class labels or new vocabulary. For comprehensive updates, retrain the model.

Args:: additional_classdict: Additional training data.

retrieve_bow(shorttext: str) → list[tuple[int, int]][source]

Get bag-of-words representation.

Args:: shorttext: Input text.
Returns:: List of (word_id, count) tuples.

retrieve_bow_vector(shorttext: str) → ndarray[tuple[Any, ...], dtype[float64]][source]

Get bag-of-words vector.

Args:: shorttext: Input text.
Returns:: BOW vector.

retrieve_corpus_topicdist(shorttext: str) → list[tuple[int, int | float]][source]

Get topic distribution (corpus form).

Args:: shorttext: Input text.
Returns:: List of (topic_id, weight) tuples.
Raises:: ModelNotTrainedException: If model not trained.

retrieve_topicvec(shorttext: str) → ndarray[tuple[Any, ...], dtype[float64]][source]

Get topic vector for short text.

Args:: shorttext: Input text.
Returns:: Topic vector.
Raises:: ModelNotTrainedException: If model not trained.

get_batch_cos_similarities(shorttext: str) → dict[str, float][source]

Get cosine similarities to all classes.

Args:: shorttext: Input text.
Returns:: Dictionary mapping class labels to similarity scores.
Raises:: ModelNotTrainedException: If model not trained.

loadmodel(nameprefix: str) → None[source]

Load topic model from files.

Args:: nameprefix: Prefix for input files.

savemodel(nameprefix: str) → None[source]

Save topic model to files.

Args:: nameprefix: Prefix for output files.
Raises:: ModelNotTrainedException: If model not trained.

get_info() → dict[str, Any][source]

Get model metadata.

Returns:: Dictionary with model information.

classmethod from_pretrained(name: str, preprocessor: callable | None = None, tokenizer: callable | None = None, compact: bool = True) → Self[source]

Load a gensim topic model from files.

Args:: name: Model name (compact) or file prefix (non-compact). preprocessor: Text preprocessing function. compact: Whether to load compact model. Default: True.
Returns:: A topic modeler instance.

class shorttext.generators.bow.GensimTopicModeling.LDAModeler(preprocessor: callable | None = None, tokenizer: callable | None = None, toweigh: bool = True, normalize: bool = True)[source]

Bases: GensimTopicModeler, CompactIOMachine

LDA topic modeler with compact I/O support.

get_info() → dict[str, Any][source]

Get model metadata.

Returns:: Dictionary with model information.

class shorttext.generators.bow.GensimTopicModeling.LSIModeler(preprocessor: callable | None = None, tokenizer: callable | None = None, toweigh: bool = True, normalize: bool = True)[source]

Bases: GensimTopicModeler, CompactIOMachine

LSI topic modeler with compact I/O support.

get_info() → dict[str, Any][source]

Get model metadata.

Returns:: Dictionary with model information.

class shorttext.generators.bow.GensimTopicModeling.RPModeler(preprocessor: callable | None = None, tokenizer: callable | None = None, toweigh: bool = True, normalize: bool = True)[source]

Bases: GensimTopicModeler, CompactIOMachine

Random Projection topic modeler with compact I/O support.

get_info() → dict[str, Any][source]

Get model metadata.

Returns:: Dictionary with model information.

shorttext.generators.bow.GensimTopicModeling.load_gensimtopicmodel(name: str, preprocessor: callable | None = None, tokenizer: callable | None = None, compact: bool = True) → GensimTopicModeler[source]: Deprecated. Use ~GensimTopicModeler.from_pretrained.

Deprecated since version 4.0.1: This will be removed in 5.0.0.

class shorttext.generators.bow.LatentTopicModeling.LatentTopicModeler(preprocessor: callable | None = None, tokenizer: callable | None = None, normalize: bool = True)[source]

Bases: ABC

Abstract base class for topic modelers.

Provides interface for converting short texts to topic vector representations using various topic modeling algorithms.

__init__(preprocessor: callable | None = None, tokenizer: callable | None = None, normalize: bool = True)[source]

Initialize the topic modeler.

Args:: preprocessor: Text preprocessing function. Default: standard_text_preprocessor_1. tokenizer: Tokenization function. Default: tokenize. normalize: Whether to normalize output vectors. Default: True.

abstractmethod train(classdict: dict[str, list[str]], nb_topics: int, *args, **kwargs) → None[source]

Train the topic modeler.

Args:: classdict: Training data with class labels as keys and texts as values. nb_topics: Number of latent topics. *args: Additional arguments for the training algorithm. **kwargs: Additional keyword arguments.
Raises:: NotImplementedError: This is an abstract method.

abstractmethod retrieve_bow(shorttext: str) → list[tuple[int, int]][source]

Get bag-of-words representation.

Args:: shorttext: Input text.
Returns:: List of (word_id, count) tuples.
Raises:: NotImplementedError: Abstract method.

abstractmethod retrieve_bow_vector(shorttext: str) → ndarray[tuple[Any, ...], dtype[float64]][source]

Get bag-of-words vector.

Args:: shorttext: Input text.
Returns:: BOW vector.
Raises:: NotImplementedError: Abstract method.

abstractmethod retrieve_topicvec(shorttext: str) → ndarray[tuple[Any, ...], dtype[float64]][source]

Get topic vector for short text.

Args:: shorttext: Input text.
Returns:: Topic vector.
Raises:: NotImplementedError: Abstract method.

abstractmethod get_batch_cos_similarities(shorttext: str) → dict[str, float][source]

Get cosine similarities to all classes.

Args:: shorttext: Input text.
Returns:: Dictionary mapping class labels to similarity scores.
Raises:: NotImplementedError: Abstract method.

__getitem__(shorttext) → ndarray[tuple[Any, ...], dtype[float64]][source]: Get topic vector for text (shortcut for retrieve_topicvec).

__contains__(shorttext)[source]: Check if model is trained.

abstractmethod loadmodel(nameprefix: str)[source]

Load model from files.

Args:: nameprefix: Prefix for input files.
Raises:: NotImplementedError: Abstract method.

abstractmethod savemodel(nameprefix: str)[source]

Save model to files.

Args:: nameprefix: Prefix for output files.
Raises:: NotImplementedError: Abstract method.

abstractmethod get_info() → dict[str, Any][source]

Get model metadata.

Returns:: Dictionary with model information.

shorttext.generators.bow.AutoEncodingTopicModeling.get_autoencoder_models(vector_size: int, nb_latent_vector_size: int) → AutoEncoderPackage[source]

Create autoencoder model components.

Args:: vector_size: Size of input vectors. nb_latent_vector_size: Size of the latent space (number of topics).
Returns:: AutoEncoderPackage containing autoencoder, encoder, and decoder models.

class shorttext.generators.bow.AutoEncodingTopicModeling.AutoencodingTopicModeler(preprocessor: callable | None = None, tokenizer: callable | None = None, normalize: bool = True)[source]

Bases: LatentTopicModeler, CompactIOMachine

Topic modeler using autoencoder.

Uses a Keras autoencoder to learn latent topic representations. The encoded vectors serve as topic vectors for short text classification.

Reference:: Francois Chollet, “Building Autoencoders in Keras,” https://blog.keras.io/building-autoencoders-in-keras.html

train(classdict: dict[str, list[str]], nb_topics: int, *args, **kwargs) → None[source]

Train the autoencoder topic model.

Args:: classdict: Training data with class labels as keys and texts as values. nb_topics: Number of latent topics (encoding dimensions). *args: Arguments for Keras model fitting. **kwargs: Keyword arguments for Keras model fitting.

retrieve_bow(shorttext: str) → list[tuple[int, int]][source]

Get bag-of-words representation.

Args:: shorttext: Input text.
Returns:: List of (token_index, count) tuples.

retrieve_bow_vector(shorttext: str) → ndarray[tuple[Any, ...], dtype[float64]][source]

Get bag-of-words vector.

Args:: shorttext: Input text.
Returns:: BOW vector (normalized if normalize=True).

retrieve_topicvec(shorttext: str) → ndarray[tuple[Any, ...], dtype[float64]][source]

Get topic vector for short text.

Args:: shorttext: Input text.
Returns:: Encoded vector representation.
Raises:: ModelNotTrainedException: If model not trained.

precalculate_liststr_topicvec(shorttexts: list[str]) → ndarray[tuple[Any, ...], dtype[float64]][source]

Calculate average topic vector for a list of texts.

Used during training to compute class centroids.

Args:: shorttexts: List of texts.
Returns:: Average topic vector (normalized).
Raises:: ModelNotTrainedException: If model not trained.

get_batch_cos_similarities(shorttext: str) → dict[str, float][source]

Get cosine similarities to all class centroids.

Args:: shorttext: Input text.
Returns:: Dictionary mapping class labels to similarity scores.
Raises:: ModelNotTrainedException: If model not trained.

savemodel(nameprefix: str, save_complete_autoencoder: bool = True) → None[source]

Save the autoencoder model to files.

Saves encoder, optional decoder, and autoencoder weights along with configuration parameters.

Args:: nameprefix: Prefix for output files. save_complete_autoencoder: Whether to save decoder and complete autoencoder. Default: True.
Raises:: ModelNotTrainedException: If model not trained.

loadmodel(nameprefix: str, load_incomplete: bool = False) → None[source]

Load the autoencoder model from files.

Args:: nameprefix: Prefix for input files. load_incomplete: If True, only load encoder (for models from v0.2.1). Default: False.
Raises:: ModelNotTrainedException: If loading fails.

get_info() → dict[str, Any][source]

Get model metadata.

Returns:: Dictionary with model information.

classmethod from_pretrained(name: str, preprocessor: callable | None = None, tokenizer: callable | None = None, compact: bool = True) → Self[source]

Load an autoencoder topic model from files.

Args:: name: Model name (compact) or file prefix (non-compact). preprocessor: Text preprocessing function. compact: Whether to load compact model. Default: True.
Returns:: An AutoencodingTopicModeler instance.

shorttext.generators.bow.AutoEncodingTopicModeling.load_autoencoder_topicmodel(name: str, preprocessor: callable | None = None, tokenizer: callable | None = None, compact: bool = True) → AutoencodingTopicModeler[source]: Deprecated. Use ~AutoEncodingTopicModeling.from_pretrained.

Deprecated since version 4.0.1: This will be removed in 5.0.0.

Sequence-to-Sequence Generators

class shorttext.generators.seq2seq.s2skeras.Seq2SeqWithKeras(vecsize: int, latent_dim: int)[source]

Bases: CompactIOMachine

Sequence-to-sequence (seq2seq) model using Keras.

Implements encoder-decoder architecture for sequence generation tasks.

Reference:

Ilya Sutskever, James Martens, Geoffrey Hinton, “Generating Text with Recurrent Neural Networks,” ICML (2011). https://www.cs.utoronto.ca/~ilya/pubs/2011/LANG-RNN.pdf

Ilya Sutskever, Oriol Vinyals, Quoc V. Le, “Sequence to Sequence Learning with Neural Networks,” arXiv:1409.3215 (2014). https://arxiv.org/abs/1409.3215

Francois Chollet, “A ten-minute introduction to sequence-to-sequence learning in Keras,” The Keras Blog. https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html

Aurelien Geron, Hands-On Machine Learning with Scikit-Learn and TensorFlow (Sebastopol, CA: O’Reilly Media, 2017).

__init__(vecsize: int, latent_dim: int)[source]

Initialize the model.

Args:: vecsize: Vector size of the sequence. latent_dim: Latent dimension in the RNN cell.

prepare_model() → None[source]: Prepare the Keras model.

compile(optimizer: Literal['sgd', 'rmsprop', 'adagrad', 'adadelta', 'adam', 'adamax', 'nadam'] = 'rmsprop', loss: str = 'categorical_crossentropy') → None[source]

Compile the Keras model.

Args:: optimizer: Optimizer for gradient descent. Options: sgd, rmsprop, adagrad, adadelta, adam, adamax, nadam. Default: rmsprop. loss: Loss function from tensorflow.keras. Default: ‘categorical_crossentropy’.

fit(encoder_input: ndarray[tuple[Any, ...], dtype[float64]], decoder_input: ndarray[tuple[Any, ...], dtype[float64]], decoder_output: ndarray[tuple[Any, ...], dtype[float64]], batch_size: int = 64, epochs: int = 100) → None[source]

Fit the seq2seq model.

Args:: encoder_input: Encoder input, a rank-3 tensor. decoder_input: Decoder input, a rank-3 tensor. decoder_output: Decoder output, a rank-3 tensor. batch_size: Batch size. Default: 64. epochs: Number of epochs. Default: 100.

savemodel(prefix: str, final: bool = False) → None[source]

Save the trained model to files.

For compact save, use save_compact_model instead.

Args:: prefix: Prefix of the file path. final: Whether the model is final (cannot be further trained). Default: False.
Raises:: ModelNotTrainedException: If no trained model exists.

loadmodel(prefix: str) → None[source]

Load a trained model from files.

For compact load, use load_compact_model instead.

Args:: prefix: Prefix of the file path.

classmethod from_pretrained(path: str | PathLike, compact: bool = True) → Self[source]

Load a trained Seq2SeqWithKeras model from file.

Args:: path: Path of the model file. compact: Whether to load a compact model. Default: True.
Returns:: Seq2SeqWithKeras instance for sequence-to-sequence inference.

shorttext.generators.seq2seq.s2skeras.load_seq2seq_model(path: str | PathLike, compact: bool = True) → Seq2SeqWithKeras[source]

Load a trained Seq2SeqWithKeras model from file.

Args:: path: Path of the model file. compact: Whether to load a compact model. Default: True.
Returns:: Seq2SeqWithKeras instance for sequence-to-sequence inference.

Deprecated since version 4.0.1: This will be removed in 5.0.0.

shorttext.generators.seq2seq.s2skeras.loadSeq2SeqWithKeras(path: str | PathLike, compact: bool = True) → Seq2SeqWithKeras[source]: Deprecated. Call load_seq2seq_model instead.

Deprecated since version 4.0.0: This will be removed in 4.1.0.

class shorttext.generators.seq2seq.charbaseS2S.CharBasedSeq2SeqGenerator(sent2charvec_encoder: SentenceToCharVecEncoder, latent_dim: int, maxlen: int)[source]

Bases: CompactIOMachine

Character-based sequence-to-sequence model.

Implements seq2seq at the character level. Uses Seq2SeqWithKeras internally.

Reference:: Oriol Vinyals, Quoc Le, “A Neural Conversational Model,” arXiv:1506.05869 (2015). https://arxiv.org/abs/1506.05869

__init__(sent2charvec_encoder: SentenceToCharVecEncoder, latent_dim: int, maxlen: int)[source]

Initialize the generator.

Args:: sent2charvec_encoder: Character encoder. latent_dim: Number of latent dimensions. maxlen: Maximum length of a sentence.

compile(optimizer: Literal['sgd', 'rmsprop', 'adagrad', 'adadelta', 'adam', 'adamax', 'nadam'] = 'rmsprop', loss: str = 'categorical_crossentropy') → None[source]

Compile the Keras model.

Args:: optimizer: Optimizer for gradient descent. Options: sgd, rmsprop, adagrad, adadelta, adam, adamax, nadam. Default: rmsprop. loss: Loss function from tensorflow.keras. Default: ‘categorical_crossentropy’.

prepare_trainingdata(txtseq: str) → tuple[ndarray[tuple[Any, ...], dtype[float64]], ndarray[tuple[Any, ...], dtype[float64]], ndarray[tuple[Any, ...], dtype[float64]]][source]

Transform text to numerical vector format.

Args:: txtseq: Input text.
Returns:: Tuple of (encoder_input, decoder_input, decoder_output) as rank-3 tensors.

train(txtseq: str, batch_size: int = 64, epochs: int = 100, optimizer: Literal['sgd', 'rmsprop', 'adagrad', 'adadelta', 'adam', 'adamax', 'nadam'] = 'rmsprop', loss: str = 'categorical_crossentropy') → None[source]

Train the character-based seq2seq model.

Args:: txtseq: Training text. batch_size: Batch size. Default: 64. epochs: Number of epochs. Default: 100. optimizer: Optimizer for gradient descent. Default: rmsprop. loss: Loss function from tensorflow.keras. Default: ‘categorical_crossentropy’.

decode(txtseq: str, stochastic: bool = True) → str[source]

Generate output text from input text.

Args:: txtseq: Input text. stochastic: Whether to use stochastic sampling. Default: True.
Returns:: Generated output text.

savemodel(prefix: str, final: bool = False) → None[source]

Save the trained model to files.

For compact save, use save_compact_model instead.

Args:: prefix: Prefix of the file path. final: Whether the model is final (cannot be further trained). Default: False.
Raises:: ModelNotTrainedException: If no trained model exists.

loadmodel(prefix: str) → None[source]

Load a trained model from files.

For compact load, use load_compact_model instead.

Args:: prefix: Prefix of the file path.

classmethod from_pretrained(path: str | PathLike, compact: bool = True) → Self[source]

Load a trained CharBasedSeq2SeqGenerator from file.

Args:: path: Path of the model file. compact: Whether to load a compact model. Default: True.
Returns:: CharBasedSeq2SeqGenerator instance for seq2seq inference.

shorttext.generators.seq2seq.charbaseS2S.loadCharBasedSeq2SeqGenerator(path: str | PathLike, compact: bool = True) → CharBasedSeq2SeqGenerator[source]: Deprecated. Use ~CharBasedSeq2SeqGenerator.from_pretrained.

Deprecated since version 4.0.1: This will be removed in 5.0.0.

Character-Based Generators

class shorttext.generators.charbase.char2vec.SentenceToCharVecEncoder(dictionary: gensim.corpora.Dictionary, signalchar: str = '\n')[source]

Bases: object

One-hot encoder for character-level text representations.

Converts sentences into one-hot encoded vectors at the character level. Useful for character-level sequence models.

Reference:: General architecture inspired by char-RNN and related models.

__init__(dictionary: gensim.corpora.Dictionary, signalchar: str = '\n')[source]

Initialize the character vector encoder.

Args:: dictionary: Gensim Dictionary mapping characters to indices. signalchar: Signal character for sequence markers. Default: ‘n’.

calculate_prelim_vec(sent: str) → ndarray[tuple[Any, ...], dtype[float64]][source]

Convert sentence to one-hot character vectors.

Args:: sent: Input sentence.
Returns:: One-hot encoded sparse matrix where each row represents a character’s encoding.

encode_sentence(sent: str, maxlen: int, startsig: bool = False, endsig=False) → csc_matrix[source]

Encode a sentence to a sparse character vector matrix.

Args:: sent: Input sentence to encode. maxlen: Maximum length of the encoded sequence. startsig: Whether to prepend signal character. Default: False. endsig: Whether to append signal character. Default: False.
Returns:: Sparse matrix representing the sentence with shape (maxlen + startsig + endsig, num_chars).

encode_sentences(sentences: list[str], maxlen: int, sparse: bool = True, startsig: bool = False, endsig: bool = False) → list[ndarray[tuple[Any, ...], dtype[float64]]] | ndarray[tuple[Any, ...], dtype[float64]][source]

Encode multiple sentences into character vectors.

Args:: sentences: List of sentences to encode. maxlen: Maximum length for each encoded sentence. sparse: Whether to return sparse matrices. Default: True. startsig: Whether to prepend signal character. Default: False. endsig: Whether to append signal character. Default: False.
Returns:: If sparse=True: list of sparse matrices. If sparse=False: numpy array of shape (n_sentences, maxlen, num_chars).

__len__() → int[source]: Return the number of unique characters in the dictionary.

classmethod from_pretrained(textfile: str | PathLike, encoding: bool | None = None) → Self[source]

Create a SentenceToCharVecEncoder from a text file.

Builds a character dictionary from the given text file and returns an encoder instance.

Args:: textfile: Path to the text file for building the character dictionary. encoding: Encoding of the text file. Default: None.
Returns:: A SentenceToCharVecEncoder instance.

shorttext.generators.charbase.char2vec.initialize_SentenceToCharVecEncoder(textfile: str | PathLike, encoding: bool | None = None) → SentenceToCharVecEncoder[source]: Deprecated. Use ~SentenceToCharVecEncoder.from_pretrained.

shorttext.generators.charbase.char2vec.initSentenceToCharVecEncoder(textfile: str | PathLike, encoding: bool | None = None) → SentenceToCharVecEncoder[source]: Deprecated. Use initialize_SentenceToCharVecEncoder instead.

Deprecated since version 4.0.0: This will be removed in 4.1.0.

Metrics

shorttext.metrics.dynprog.jaccard.similarity(word1: str, word2: str) → float[source]

Calculate similarity between two words.

Computes similarity as the maximum of: - 1 - Damerau-Levenshtein distance / max length - Longest common prefix length / max length

Args:: word1: First word. word2: Second word.
Returns:: Similarity score between 0 and 1.
Reference:: Daniel E. Russ, Kwan-Yuet Ho, Calvin A. Johnson, Melissa C. Friesen, “Computer-Based Coding of Occupation Codes for Epidemiological Analyses,” IEEE CBMS 2014, pp. 347-350. http://ieeexplore.ieee.org/abstract/document/6881904/

shorttext.metrics.dynprog.jaccard.soft_intersection_list(tokens1: list[str], tokens2: list[str]) → set[str][source]

Compute soft intersection between two token lists.

Finds the best matching pairs between tokens using similarity, where each token can only match once.

Args:: tokens1: First list of tokens. tokens2: Second list of tokens.
Returns:: Set of ((token1, token2), similarity) tuples representing matches.

shorttext.metrics.dynprog.jaccard.soft_jaccard_score(tokens1: str, tokens2: str) → float[source]

Compute soft Jaccard score between token lists.

Uses fuzzy matching based on edit distance and longest common prefix.

Args:: tokens1: First list of tokens. tokens2: Second list of tokens.
Returns:: Soft Jaccard score between 0 and 1.
Reference:: Daniel E. Russ, Kwan-Yuet Ho, Calvin A. Johnson, Melissa C. Friesen, “Computer-Based Coding of Occupation Codes for Epidemiological Analyses,” IEEE CBMS 2014, pp. 347-350. http://ieeexplore.ieee.org/abstract/document/6881904/

shorttext.metrics.dynprog.dldist.damerau_levenshtein(word1: str, word2: str) → int

Calculate the Damerau-Levenshtein distance between two words.

Computes the edit distance considering adjacent transpositions (swapping two adjacent characters counts as one edit).

Args:: word1: First word. word2: Second word.
Returns:: The Damerau-Levenshtein distance between the two words.
Reference:: https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

shorttext.metrics.dynprog.lcp.longest_common_prefix(word1: str, word2: str) → int

Calculate the longest common prefix length of two strings.

Args:: word1: First string. word2: Second string.
Returns:: Length of the longest common prefix.

shorttext.metrics.wasserstein.wordmoverdist.word_mover_distance_linprog(first_sent_tokens: list[str], second_sent_tokens: list[str], wvmodel: gensim.models.keyedvectors.KeyedVectors, distancefunc: callable | None = None) → OptimizeResult[source]

Compute Word Mover’s distance via linear programming.

Uses scipy.optimize.linprog to compute the transport problem for the Word Mover’s Distance.

Args:: first_sent_tokens: First list of tokens. second_sent_tokens: Second list of tokens. wvmodel: Word embedding model. distancefunc: Distance function for word vectors. Default: Euclidean.
Returns:: scipy.optimize.OptimizeResult containing the optimization result.
Reference:: Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, Kilian Q. Weinberger, “From Word Embeddings to Document Distances,” ICML 2015.

shorttext.metrics.wasserstein.wordmoverdist.word_mover_distance(first_sent_tokens: list[str], second_sent_tokens: list[str], wvmodel: gensim.models.keyedvectors.KeyedVectors, distancefunc: callable | None = None) → float[source]

Compute Word Mover’s distance between token lists.

Uses word embeddings to compute the minimum transport cost between words in two sentences.

Args:: first_sent_tokens: First list of tokens. second_sent_tokens: Second list of tokens. wvmodel: Word embedding model. distancefunc: Distance function for word vectors. Default: Euclidean.
Returns:: The Word Mover’s distance (lower is more similar).
Reference:: Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, Kilian Q. Weinberger, “From Word Embeddings to Document Distances,” ICML 2015.

shorttext.metrics.embedfuzzy.jaccard.jaccardscore_sents(sent1: str, sent2: str, wvmodel: gensim.models.keyedvectors.KeyedVectors, sim_words: callable | None = None) → float[source]

Compute Jaccard score between sentences using embeddings.

Uses word embeddings to compute a fuzzy Jaccard score where word similarity is measured via embedding cosine similarity.

Args:: sent1: First sentence. sent2: Second sentence. wvmodel: Word embedding model. sim_words: Similarity function for word vectors. Default: cosine.
Returns:: Fuzzy Jaccard score between 0 and 1.

Spell Correction

class shorttext.spell.basespellcorrector.SpellCorrector[source]

Bases: ABC

Abstract base class for spell correctors.

Defines the interface for spelling correction algorithms.

abstractmethod train(text: str) → None[source]

Train the spell corrector on a corpus.

Args:: text: Training text corpus.

abstractmethod correct(word: str) → str[source]

Recommend a spelling correction for a word.

Args:: word: Word to correct.
Returns:: The corrected word.

class shorttext.spell.norvig.NorvigSpellCorrector[source]

Bases: SpellCorrector

Spell corrector based on Peter Norvig’s algorithm.

Uses word frequency counts to suggest corrections for misspelled words by finding edits that exist in the vocabulary.

Reference:: https://norvig.com/spell-correct.html

__init__()[source]: Initialize the spell corrector.

train(text: str) → None[source]

Train on a text corpus.

Builds a word frequency dictionary from the input text.

Args:: text: Training text corpus.

P(word: str) → float[source]

Compute word probability from the training corpus.

Args:: word: Word to get probability for.
Returns:: Probability of the word appearing in the corpus.

correct(word: str) → str[source]

Recommend spelling correction for a word.

Args:: word: Word to correct.
Returns:: Most likely correction, or the original word if no better option.

known(words: list[str]) → set[str][source]

Filter words found in the training vocabulary.

Args:: words: List of words to check.
Returns:: Subset of words that appear in the training corpus.

candidates(word: str) → Generator[str, None, None][source]

Generate spelling correction candidates.

Checks exact match, then edits of distance 1 and 2.

Args:: word: Word to find candidates for.
Yields:: Viable correction candidates.

shorttext.spell.editor.compute_set_edits1(word: str) → set[str]

Generate all single-edit distance words.

Creates all possible words that are one edit (insert, delete, transpose, replace) away from the input word.

Args:: word: Input word.
Returns:: Set of all possible single-edit variations.

shorttext.spell.editor.compute_set_edits2(word: str) → Generator[str, None, None]

Generate all double-edit distance words.

Creates all possible words that are two edits away from the input word by applying compute_set_edits1 to each result.

Args:: word: Input word.
Yields:: All possible double-edit variations.

Stacking

class shorttext.stack.stacking.StackedGeneralization(intermediate_classifiers: dict[str, AbstractScorer] | None = None)[source]

Bases: ABC

Abstract base class for stacked generalization.

An intermediate model that takes output from other classifiers as input features and performs another level of classification.

The classifiers must have the score() method that takes a string as input.

Reference:

David H. Wolpert, “Stacked Generalization,” Neural Netw 5: 241-259 (1992).

M. Paz Sesmero et al., “Generating ensembles of heterogeneous classifiers using Stacked Generalization,” WIREs Data Mining and Knowledge Discovery 5: 21-34 (2015).

__init__(intermediate_classifiers: dict[str, AbstractScorer] | None = None)[source]

Initialize the stacking class.

Args:: intermediate_classifiers: Dictionary mapping names to classifier instances.

register_classifiers() → None[source]

Register the intermediate classifiers.

Must be called before training.

register_classlabels(labels: list[str]) → None[source]

Register output labels.

Args:: labels: List of output class labels.

Must be called before training.

add_classifier(name: str, classifier: AbstractScorer) → None[source]

Add a classifier to the stack.

Args:: name: Name for the classifier (no spaces or special characters). classifier: Classifier instance with a score() method.

delete_classifier(name: str) → None[source]

Delete a classifier from the stack.

Args:: name: Name of the classifier to delete.
Raises:: KeyError: If classifier name not found.

translate_shorttext_intfeature_matrix(shorttext: str) → Annotated[ndarray[tuple[Any, ...], dtype[float64]], '2D Array'][source]

Convert short text to feature matrix for stacking.

Args:: shorttext: Input text.
Returns:: Feature matrix of shape (n_classifiers, n_labels).

convert_label_to_buckets(label: str) → Annotated[ndarray[tuple[Any, ...], dtype[int64]], '1D Array'][source]

Convert label to one-hot bucket representation.

Args:: label: Class label.
Returns:: One-hot array with 1 at the label’s position.

convert_traindata_matrix(classdict: dict[str, list[str]], tobucket: bool = True) → Generator[tuple[Annotated[ndarray[tuple[Any, ...], dtype[float64]], '2D Array'], Annotated[ndarray[tuple[Any, ...], dtype[int64]], '1D Array']], None, None][source]

Yield training data matrices.

Args:: classdict: Training data dictionary. tobucket: Whether to convert labels to buckets. Default: True.
Yields:: Tuples of (feature_matrix, label_array).

abstractmethod train(classdict: dict[str, list[str]], *args, **kwargs) → None[source]

Train the stacked generalization model.

Args:: classdict: Training data. *args: Additional arguments. **kwargs: Additional keyword arguments.
Raises:: NotImplementedError: Abstract method.

abstractmethod score(shorttext: str, *args, **kwargs) → dict[str, float][source]

Calculate classification scores for all labels.

Args:: shorttext: Input text. *args: Additional arguments. **kwargs: Additional keyword arguments.
Returns:: Dictionary mapping class labels to scores.
Raises:: NotImplementedError: Abstract method.

class shorttext.stack.stacking.LogisticStackedGeneralization(intermediate_classifiers: dict[str, AbstractScorer] | None = None)[source]

Bases: StackedGeneralization, CompactIOMachine

Stacked generalization using logistic regression.

Uses neural network with sigmoid output to combine predictions from intermediate classifiers.

Note:: Saves the stacked model but not the intermediate classifiers.

train(classdict: dict[str, list[str]], optimizer: Literal['sgd', 'rmsprop', 'adagrad', 'adadelta', 'adam', 'adamax', 'nadam'] = 'adam', l2reg: float = 0.01, bias_l2reg: float = 0.01, nb_epoch: int = 1000) → None[source]

Train the stacked generalization model.

Args:: classdict: Training data. optimizer: Optimizer for training. Options: sgd, rmsprop, adagrad, adadelta, adam, adamax, nadam. Default: adam. l2reg: L2 regularization coefficient. Default: 0.01. bias_l2reg: L2 regularization for bias. Default: 0.01. nb_epoch: Number of training epochs. Default: 1000.

score(shorttext: str) → dict[str, float][source]

Calculate classification scores for all labels.

Args:: shorttext: Input text.
Returns:: Dictionary mapping class labels to scores.
Raises:: ModelNotTrainedException: If model not trained.

savemodel(nameprefix: str) → None[source]

Save the stacked model to files.

Note: Intermediate classifiers are not saved. Save them separately.

Args:: nameprefix: Prefix for output files.
Raises:: ModelNotTrainedException: If model not trained.

loadmodel(nameprefix: str) → None[source]

Load the stacked model from files.

Note: Intermediate classifiers are not loaded. Load them separately.

Args:: nameprefix: Prefix for input files.

Data

shorttext.data.data_retrieval.retrieve_csvdata_as_dict(filepath: str | PathLike) → dict[str, list[str]][source]

Retrieve the training data in a CSV file.

Reads a CSV file where the first column contains class labels and the second column contains text data. Returns a dictionary mapping class labels to lists of short texts.

Args:: filepath: Path to the CSV training data file.
Returns:: A dictionary with class labels as keys and lists of short texts as values.
Reference:: Data format inspired by common text classification benchmarks.

shorttext.data.data_retrieval.retrieve_jsondata_as_dict(filepath: str | PathLike) → dict[source]

Retrieve the training data in a JSON file.

Reads a JSON file where class labels are keys and lists of short texts are values. Returns the corresponding dictionary.

Args:: filepath: Path to the JSON training data file.
Returns:: A dictionary with class labels as keys and lists of short texts as values.

shorttext.data.data_retrieval.get_or_download_data(filename: str, origin: str, asbytes: bool = False) → TextIOWrapper[source]

Retrieve or download a data file.

Checks if the file exists in the user’s home directory under .shorttext. If not present, downloads from the given origin URL.

Args:: filename: Name of the file to retrieve. origin: URL to download the file from if not present locally. asbytes: If True, opens the file in binary mode. Default is False.
Returns:: A file object (text or binary mode depending on asbytes).

shorttext.data.data_retrieval.subjectkeywords() → dict[str, list[str]][source]

Return an example dataset of subjects with keywords.

Returns a small example dataset with three subjects and their corresponding keywords, in the training input format.

Returns:: A dictionary with subject labels as keys and lists of keywords as values.

shorttext.data.data_retrieval.inaugural() → dict[str, list[str]][source]

Return the Inaugural Addresses of US Presidents.

Returns an example dataset containing the Inaugural Addresses of all Presidents of the United States from George Washington to Barack Obama.

Each key is formatted as “year-lastname” and the value is a list of sentences from the address.

Returns:: A dictionary with president identifiers as keys and lists of sentences as values.
Reference:: https://www.presidency.us/kisa_exec/inaugural.html

shorttext.data.data_retrieval.nihreports(txt_col='PROJECT_TITLE', label_col='FUNDING_ICs', sample_size=512)[source]

Return an example dataset sampled from NIH RePORT.

Returns an example dataset from NIH (National Institutes of Health) RePORT (Research Portfolio Online Reporting Tools) website.

Args:

txt_col: Column for text data. Options: ‘PROJECT_TITLE’ or ‘ABSTRACT_TEXT’.: Default: ‘PROJECT_TITLE’.
label_col: Column for labels. Options: ‘FUNDING_ICs’ or ‘IC_NAME’.: Default: ‘FUNDING_ICs’.

sample_size: Number of samples to return. Set to None for all rows. Default: 512.

Returns:

A dictionary with IC identifiers as keys and lists of text data as values.

Reference:

https://exporter.nih.gov/ExPORTER_Catalog.aspx Dataset adapted from the R package textmineR: https://cran.r-project.org/web/packages/textmineR/index.html

shorttext.data.data_retrieval.merge_cv_dicts(dicts: list[dict[str, list[str]]]) → dict[str, list[str]][source]

Merge multiple training data dictionaries.

Combines multiple data dictionaries in the training data format into a single dictionary.

Args:

dicts: List of dictionaries to merge, each with class labels: as keys and lists of texts as values.

Returns:

A merged dictionary with all class labels and texts combined.

shorttext.data.data_retrieval.yield_crossvalidation_classdicts(classdict: dict[str, list[str]], nb_partitions: int, shuffle: bool = False) → Generator[tuple[dict[str, list[str]], dict[str, list[str]]], None, None][source]

Yield training and test data partitions for cross-validation.

Partitions the training data into multiple sets. Each iteration yields a (test_dict, train_dict) pair where one partition is used as test data and the remaining partitions are combined as training data.

Args:

classdict: Training data dictionary with class labels as keys: and lists of texts as values.

nb_partitions: Number of partitions to create. shuffle: Whether to shuffle data before partitioning. Default: False.

Yields:

Tuples of (test_dict, train_dict) for each partition.

Utilities

shorttext.utils.kerasmodel_io.save_model(nameprefix: str, model: tensorflow.keras.models.Model) → None[source]

Save a Keras model to files.

Args:: nameprefix: Prefix for output files. model: Keras model to save.

shorttext.utils.kerasmodel_io.load_model(nameprefix: str) → tensorflow.keras.models.Model[source]

Load a Keras model from files.

Args:: nameprefix: Prefix for input files.
Returns:: Loaded Keras model.

This module contains general routines to zip all model files into one compact file. The model can be copied or transferred easily.

The methods and decorators in this module are called by other codes. It is not recommended for developers to call them directly.

shorttext.utils.compactmodel_io.removedir(dir: str) → None[source]

Remove all subdirectories and files under the specified path.

Args:: dir: Path of the directory to clean.

shorttext.utils.compactmodel_io.save_compact_model(filename: str, savefunc: callable, prefix: str, suffices: str, infodict: dict[str, Any]) → None[source]

Save the model in one compact file by zipping all related files.

Args:: filename: Name of the output model file. savefunc: Function that performs the saving action. Takes one argument (str) - the prefix. prefix: Prefix of the names of the files related to the model. suffices: List of file suffixes. infodict: Dictionary with model information. Must contain the key ‘classifier’.

shorttext.utils.compactmodel_io.load_compact_model(filename: str, loadfunc: callable, prefix: str, infodict: dict[str, Any]) → Any[source]

Load a model from a compact file.

Args:: filename: Name of the model file. loadfunc: Function that performs the loading action. Takes one argument (str) - the prefix. prefix: Prefix of the names of the files. infodict: Dictionary with model information. Must contain the key ‘classifier’.
Returns:: The loaded model instance.

class shorttext.utils.compactmodel_io.CompactIOMachine(infodict: dict[str, Any], prefix: str, suffices: list[str])[source]

Bases: ABC

Base class that implements compact model I/O.

Replaces the original compactio decorator.

__init__(infodict: dict[str, Any], prefix: str, suffices: list[str])[source]

Initialize the compact I/O machine.

Args:: infodict: Dictionary with model information. Must contain ‘classifier’. prefix: Prefix for model file names. suffices: List of file suffixes for the model files.

abstractmethod savemodel(nameprefix: str) → None[source]

Save the model to files.

Args:: nameprefix: Prefix for model file paths.

abstractmethod loadmodel(nameprefix: str) → Self[source]

Load the model from files.

Args:: nameprefix: Prefix for model file paths.

save_compact_model(filename: str, *args, **kwargs) → None[source]

Save the model in a compressed binary format.

Args:: filename: Name of the model file. *args: Additional arguments. **kwargs: Additional keyword arguments.

load_compact_model(filename: str, *args, **kwargs) → Self[source]

Load the model from a compressed binary format.

Args:: filename: Name of the model file. *args: Additional arguments. **kwargs: Additional keyword arguments.

get_info() → dict[str, Any][source]

Get model metadata.

Returns:: Dictionary with classifier, prefix, and suffices.

shorttext.utils.compactmodel_io.get_model_config_field(filename: str | PathLike, parameter: str) → str[source]

Get a configuration parameter from a compact model file.

Args:: filename: Path to the model file. parameter: Parameter name to retrieve.
Returns:: The parameter value.

shorttext.utils.compactmodel_io.get_model_classifier_name(filename: str | PathLike) → str[source]

Get the classifier name from a compact model file.

Args:: filename: Path to the model file.
Returns:: The classifier name.

shorttext.utils.gensim_corpora.generate_gensim_corpora(classdict: dict[str, list[str]], preprocess_and_tokenize: callable | None = None) → tuple[gensim.corpora.Dictionary, list[list[tuple[int, int]]], list[str]][source]

Generate gensim dictionary and corpus from training data.

Args:: classdict: Training data with class labels as keys and lists of texts as values. preprocess_and_tokenize: Function to preprocess and tokenize text. Default: tokenize.
Returns:: Tuple of (dictionary, corpus, class_labels).

shorttext.utils.gensim_corpora.save_corpus(dictionary: gensim.corpora.Dictionary, corpus: list[list[tuple[int, int]]], prefix: str) → None[source]

Save gensim corpus and dictionary to files.

Args:: dictionary: Dictionary to save. corpus: Corpus to save. prefix: Prefix for output files.
Note:: Deprecated since 5.0.0, will be removed in 6.0.0.

Deprecated since version 4.0.0: This will be removed in 5.0.0.

shorttext.utils.gensim_corpora.load_corpus(prefix: str) → tuple[gensim.corpora.MmCorpus, gensim.corpora.Dictionary][source]

Load gensim corpus and dictionary from files.

Args:: prefix: Prefix of files to load.
Returns:: Tuple of (corpus, dictionary).
Note:: Deprecated since 5.0.0, will be removed in 6.0.0.

Deprecated since version 4.0.0: This will be removed in 5.0.0.

shorttext.utils.gensim_corpora.update_corpus_labels(dictionary: gensim.corpora.Dictionary, corpus: list[list[tuple[int, int]]], newclassdict: dict[str, list[str]], preprocess_and_tokenize: callable | None = None) → tuple[list[list[tuple[int, int]]], list[list[tuple[int, int]]]][source]

Update corpus with additional training data.

Args:: dictionary: Existing dictionary. corpus: Existing corpus. newclassdict: Additional training data. preprocess_and_tokenize: Function to preprocess text. Default: tokenize.
Returns:: Tuple of (updated_corpus, new_corpus).

shorttext.utils.gensim_corpora.tokens_to_fracdict(tokens: list[str]) → dict[str, float][source]

Convert tokens to normalized frequency dictionary.

Args:: tokens: List of tokens.
Returns:: Dictionary with tokens as keys and normalized frequencies as values.

shorttext.utils.textpreprocessing.tokenize(s: str) → list[str][source]

Tokenize a string by splitting on whitespace.

Args:: s: Input string to tokenize.
Returns:: List of tokens split by whitespace.

class shorttext.utils.textpreprocessing.StemmerSingleton[source]

Bases: object

Singleton class for Porter stemmer.

Provides a singleton instance of the snowball stemmer for English.

__call__(s: str) → str[source]

Stem a word using Porter stemmer.

Args:: s: Word to stem.
Returns:: Stemmed word.

shorttext.utils.textpreprocessing.stemword(s: str) → str[source]

Stem a word using Porter stemmer.

Args:: s: Word to stem.
Returns:: Stemmed word.

shorttext.utils.textpreprocessing.preprocess_text(text: str, pipeline: list[callable]) → str[source]

Preprocess text according to a given pipeline.

Applies a sequence of preprocessing functions to the input text. Each function in the pipeline transforms the text (e.g., stemming, lemmatizing, removing punctuation).

Args:: text: Input text to preprocess. pipeline: List of functions that each transform a text string to another text string.
Returns:: The preprocessed text after applying all pipeline functions.

shorttext.utils.textpreprocessing.tokenize_text(text: str, presplit_pipeline: list[callable], primitize_tokenizer: callable, postsplit_pipeline: list[callable], stopwordsfile: TextIO) → list[str][source]

Tokenize text with preprocessing pipelines.

Applies pre-split and post-split pipelines to tokenize text, filtering out stopwords.

Args:: text: Input text to tokenize. presplit_pipeline: List of functions to apply before tokenization. primitize_tokenizer: Tokenizer function to split text into tokens. postsplit_pipeline: List of functions to apply to each token after tokenization. stopwordsfile: File containing stopwords to filter out.
Returns:: List of tokens after preprocessing and stopword filtering.

shorttext.utils.textpreprocessing.text_preprocessor(pipeline: list[callable]) → callable[source]

Create a text preprocessor function from a pipeline.

Returns a function that applies the given pipeline to preprocess text. This is a convenience function that wraps preprocess_text with a fixed pipeline.

Args:: pipeline: List of functions that transform text to text.
Returns:: A callable that takes text and returns preprocessed text.

shorttext.utils.textpreprocessing.oldschool_standard_text_preprocessor(stopwordsfile: TextIO) → callable[source]

Create a standard text preprocessor.

Returns a text preprocessor with the following steps: - Remove special characters - Remove numerals - Convert to lowercase - Remove stop words - Stem words using Porter stemmer

Args:: stopwordsfile: File object containing stopwords to filter.
Returns:: A callable that takes text and returns preprocessed text.

shorttext.utils.textpreprocessing.standard_text_preprocessor_1() → callable[source]

Create a standard text preprocessor using NLTK stopwords.

Returns a text preprocessor with the following steps: - Remove special characters - Remove numerals - Convert to lowercase - Remove stop words (NLTK list) - Stem words using Porter stemmer

Returns:: A callable that takes text and returns preprocessed text.

shorttext.utils.textpreprocessing.standard_text_preprocessor_2() → callable[source]

Create a standard text preprocessor with negation-aware stopwords.

Returns a text preprocessor with the following steps: - Remove special characters - Remove numerals - Convert to lowercase - Remove stop words (NLTK list minus negation terms) - Stem words using Porter stemmer

Returns:: A callable that takes text and returns preprocessed text.

shorttext.utils.textpreprocessing.advanced_text_tokenizer_1() → callable[source]

Create an advanced text tokenizer.

Returns a tokenizer function that applies preprocessing steps: - Remove special characters - Remove numerals - Convert to lowercase - Stem tokens using Porter stemmer - Filter out negation-aware stopwords

Returns:: A callable that takes text and returns a list of tokens.

shorttext.utils.wordembed.load_word2vec_model(path: str | PathLike, binary: bool = True) → gensim.models.keyedvectors.KeyedVectors[source]

Load a pre-trained Word2Vec model.

Args:: path: Path to the Word2Vec model file. binary: Whether the file is in binary format. Default: True.
Returns:: A KeyedVectors model containing word embeddings.

shorttext.utils.wordembed.load_fasttext_model(path: str | PathLike, encoding: Any = 'utf-8') → gensim.models.fasttext.FastTextKeyedVectors[source]

Load a pre-trained FastText model.

Args:: path: Path to the FastText model file. encoding: File encoding. Default: ‘utf-8’.
Returns:: A FastTextKeyedVectors model.

shorttext.utils.wordembed.load_poincare_model(path: str | PathLike, word2vec_format: bool = True, binary: bool = False) → gensim.models.poincare.PoincareKeyedVectors[source]

Load a Poincaré embedding model.

Args:: path: Path to the Poincaré model file. word2vec_format: Whether to load from word2vec format. Default: True. binary: Whether file is binary. Default: False.
Returns:: A PoincareKeyedVectors model.

shorttext.utils.wordembed.shorttext_to_avgvec(shorttext: str, wvmodel: gensim.models.keyedvectors.KeyedVectors) → Annotated[ndarray[tuple[Any, ...], dtype[float64]], '1D array'][source]

Convert short text to averaged embedding vector.

Converts each token to its word embedding, averages them, and normalizes the result.

Args:: shorttext: Input text. wvmodel: Word embedding model.
Returns:: A normalized vector representation of the text.

class shorttext.utils.wordembed.RESTfulKeyedVectors(*args: Any, **kwargs: Any)[source]

Bases: KeyedVectors

Remote word vector client via REST API.

Connects to a remote WordEmbedAPI service to access word embeddings via HTTP requests.

Attributes:: url: Base URL of the API. port: Port number for the API.

__init__(url: str, port: str | int = '5000')[source]

Initialize the client.

Args:: url: Base URL of the API (e.g., ‘http://localhost’). port: Port number. Default: ‘5000’.

closer_than(entity1: str, entity2: str) → list | dict[source]

Find words closer to entity1 than entity2 is.

Args:: entity1: First word. entity2: Reference word.
Returns:: List of words closer to entity1 than entity2.

distance(entity1: str, entity2: str) → float[source]

Compute distance between two words.

Args:: entity1: First word. entity2: Second word.
Returns:: Distance between the word vectors.

distances(entity1: str, other_entities: list[str] | None = None) → Annotated[ndarray[tuple[Any, ...], dtype[float64]], '1D array'][source]

Compute distances from one word to multiple words.

Args:: entity1: First word. other_entities: List of words to compare against.
Returns:: Array of distances.

get_vector(entity: str) → Annotated[ndarray[tuple[Any, ...], dtype[float64]], '1D array'][source]

Get word vector for a word.

Args:: entity: Word to get vector for.
Returns:: Word embedding vector.
Raises:: KeyError: If word not in vocabulary.

most_similar(**kwargs) → list[tuple[str, float]][source]

Find most similar words.

Args:: **kwargs: Arguments passed to the API (e.g., positive, negative).
Returns:: List of (word, similarity) tuples.

most_similar_to_given(entity1: str, entities_list: list[str]) → list[str][source]

Find most similar word from a list to a given word.

Args:: entity1: Reference word. entities_list: List of candidate words.
Returns:: List of words sorted by similarity.

rank(entity1: str, entity2: str) → int[source]

Get similarity rank between two words.

Args:: entity1: First word. entity2: Second word.
Returns:: Rank of entity2 relative to entity1.

save(fname_or_handle: TextIO, **kwargs) → None[source]

Save is not supported for remote vectors.

Raises:: IOError: Always, since remote vectors cannot be saved locally.

similarity(entity1: str, entity2: str) → float[source]

Compute similarity between two words.

Args:: entity1: First word. entity2: Second word.
Returns:: Similarity score between 0 and 1.

shorttext.utils.compute.cosine_similarity(vec1: Annotated[ndarray[tuple[Any, ...], dtype[float64]], '1D array'], vec2: Annotated[ndarray[tuple[Any, ...], dtype[float64]], '1D array']) → float

Compute cosine similarity between two vectors.

Args:: vec1: First vector. vec2: Second vector.
Returns:: Cosine similarity score between 0 and 1.

shorttext.utils.misc.textfile_generator(textfile: TextIOWrapper, linebreak: bool = True, encoding: bool | None = None) → Generator[str, None, None][source]

Generator that yields lines from a text file.

Args:: textfile: File object to read lines from. linebreak: Whether to include line break at end of each line. Default: True. encoding: Encoding of the text file. Default: None.
Yields:: Lines from the text file, stripped of whitespace.

class shorttext.utils.misc.SinglePoolExecutor[source]

Bases: object

Wrapper for Python map function.

Provides an interface similar to concurrent.futures.Executor.map but using a synchronous map implementation.

map(func, *iterables)[source]

Apply function to iterables element-wise.

Args:: func: Function to apply to each element. iterables: One or more iterables to process.
Returns:: An iterator yielding the results.

shorttext.utils.dtm.generate_npdict_document_term_matrix(corpus: list[str], doc_ids: list[Any], tokenize_func: callable) → NumpyNDArrayWrappedDict[source]

Generate document-term matrix as numpy dict.

Args:: corpus: List of documents. doc_ids: List of document IDs. tokenize_func: Tokenization function.
Returns:: NumpyNDArrayWrappedDict containing the document-term matrix.
Raises:: UnequalArrayLengthsException: If corpus and doc_ids have different lengths.

shorttext.utils.dtm.convert_classdict_to_corpus(classdict: dict[str, list[str]], preprocess_func: callable) → tuple[list[str], list[str]][source]

Convert class dictionary to corpus and document IDs.

Args:: classdict: Training data with class labels as keys and texts as values. preprocess_func: Text preprocessing function.
Returns:: Tuple of (corpus, doc_ids).

shorttext.utils.dtm.convert_classdict_to_xy(classdict: dict[str, list[str]], labels2idx: dict[str, int], preprocess_func: callable, tokenize_func: callable) → tuple[NumpyNDArrayWrappedDict, Annotated[SparseArray, '2D Array']][source]

Convert class dictionary to feature matrix and labels.

Args:: classdict: Training data. labels2idx: Mapping from labels to indices. preprocess_func: Text preprocessing function. tokenize_func: Tokenization function.
Returns:: Tuple of (document-term matrix, label matrix).

shorttext.utils.dtm.compute_document_frequency(npdtm: NumpyNDArrayWrappedDict) → ndarray[tuple[Any, ...], dtype[int32]][source]

Compute document frequency for each token.

Args:: npdtm: Document-term matrix.
Returns:: Array of document frequencies for each token.

shorttext.utils.dtm.compute_tfidf_document_term_matrix(npdtm: NumpyNDArrayWrappedDict, sparse: bool = True) → NumpyNDArrayWrappedDict[source]

Compute TF-IDF weighted document-term matrix.

Args:: npdtm: Document-term matrix. sparse: Whether to return sparse format. Default: True.
Returns:: TF-IDF weighted document-term matrix.

class shorttext.utils.dtm.NumpyDocumentTermMatrix(corpus: list[str] | None = None, docids: list[Any] | None = None, tfidf: bool = False, tokenize_func: callable | None = None)[source]

Bases: CompactIOMachine

Document-term matrix using numpy dict.

Provides an interface for working with document-term matrices with compact model I/O support.

__init__(corpus: list[str] | None = None, docids: list[Any] | None = None, tfidf: bool = False, tokenize_func: callable | None = None)[source]

Initialize the document-term matrix.

Args:: corpus: List of documents. docids: List of document IDs. tfidf: Whether to apply TF-IDF weighting. Default: False. tokenize_func: Tokenization function. Default: advanced_text_tokenizer_1.

generate_dtm(corpus: list[str], docids: list[Any] | None = None, tfidf: bool = False) → None[source]

Generate document-term matrix from corpus.

Args:: corpus: List of documents. docids: List of document IDs. tfidf: Whether to apply TF-IDF weighting. Default: False.

get_termfreq(docid: str, token: str) → float[source]

Get term frequency for a document and token.

Args:: docid: Document ID. token: Token.
Returns:: Term frequency.

get_total_termfreq(token: str) → float[source]

Get total frequency of a token across all documents.

Args:: token: Token.
Returns:: Total term frequency.

get_doc_frequency(token) → int[source]

Get document frequency of a token.

Args:: token: Token.
Returns:: Number of documents containing the token.

get_token_occurences(token: str) → dict[str, float][source]

Get token occurrences across all documents.

Args:: token: Token.
Returns:: Dictionary mapping document IDs to term frequencies.

get_doc_tokens(docid: str) → dict[str, float][source]

Get tokens for a specific document.

Args:: docid: Document ID.
Returns:: Dictionary mapping tokens to frequencies.

savemodel(nameprefix: str) → None[source]

Save the document-term matrix.

Args:: nameprefix: Prefix for output file.

loadmodel(nameprefix: str) → Self[source]

Load the document-term matrix.

Args:: nameprefix: Prefix for input file.

property docids: list[str]: List of document IDs.

property tokens: list[str]: List of tokens.

property nbdocs: int: Number of documents.

property nbtokens: int: Number of unique tokens.

classmethod from_npdict_file(filepath: str | PathLike) → Self[source]

Load a document-term matrix from a compact file.

Args:: filepath: Path to the compact model file.
Returns:: NumpyDocumentTermMatrix instance.

shorttext.utils.dtm.load_numpy_documentmatrixmatrix(filepath: str | PathLike) → NumpyDocumentTermMatrix[source]: Deprecated. Use ~NumpyDocumentTermMatrix.from_npdict_file.

Deprecated since version 4.0.1: This will be removed in 5.0.0.

exception shorttext.utils.classification_exceptions.ModelNotTrainedException[source]

Bases: Exception

Exception raised when attempting to use an untrained model.

exception shorttext.utils.classification_exceptions.AlgorithmNotExistException(algoname: str)[source]

Bases: Exception

Exception raised when a requested algorithm is not available.

exception shorttext.utils.classification_exceptions.WordEmbeddingModelNotExistException(path: str | PathLike)[source]

Bases: Exception

Exception raised when the word embedding model file is not found.

exception shorttext.utils.classification_exceptions.UnequalArrayLengthsException(arr1: ndarray | list, arr2: ndarray | list)[source]

Bases: Exception

Exception raised when two arrays have unequal lengths.

shorttext.utils.classification_exceptions.NotImplementedException()[source]: Exception raised when a method is not implemented.

Deprecated since version 4.0.0: This will be removed in 5.0.0.

exception shorttext.utils.classification_exceptions.IncorrectClassificationModelFileException(expectedname: str, actualname: str)[source]

Bases: Exception

Exception raised when model file doesn’t match expected type.

exception shorttext.utils.classification_exceptions.OperationNotDefinedException(opname: str)[source]

Bases: Exception

Exception raised when an operation is not defined.

Schemas

class shorttext.schemas.models.AutoEncoderPackage(autoencoder: tensorflow.keras.Model, encoder: tensorflow.keras.Model, decoder: tensorflow.keras.Model)[source]

Bases: object

Package containing autoencoder components.

Attributes:: autoencoder: The full autoencoder model. encoder: The encoder part of the autoencoder. decoder: The decoder part of the autoencoder.

autoencoder: tensorflow.keras.Model

encoder: tensorflow.keras.Model

decoder: tensorflow.keras.Model

CLI

shorttext.cli.categorization.get_argparser() → ArgumentParser[source]

Get argument parser for short text categorization CLI.

Returns:: ArgumentParser for command line arguments.

shorttext.cli.categorization.main()[source]

shorttext.cli.wordembedsim.getargparser() → ArgumentParser[source]

Get argument parser for word embedding similarity CLI.

Returns:: ArgumentParser for command line arguments.

shorttext.cli.wordembedsim.main() → None[source]

Home: Homepage of shorttext