API

API unlisted in tutorials are listed here.

Shorttext Models Smart Loading

shorttext.smartload.smartload_compact_model(filename, wvmodel, preprocessor=<function text_preprocessor.<locals>.<lambda>>, vecsize=None)

Load appropriate classifier or model from the binary model.

The second parameter, wvmodel, can be set to None if no Word2Vec model is needed.

Parameters:
  • filename (str) – path of the compact model file
  • wvmodel (gensim.models.keyedvectors.KeyedVectors) – Word2Vec model
  • preprocessor (function) – text preprocessor (Default: shorttext.utils.textpreprocess.standard_text_preprocessor_1)
  • vecsize (int) – length of embedded vectors in the model (Default: None, extracted directly from the word-embedding model)
Returns:

appropriate classifier or model

Raise:

AlgorithmNotExistException

Supervised Classification using Word Embedding

Module shorttext.generators.seq2seq.s2skeras

class shorttext.generators.seq2seq.s2skeras.Seq2SeqWithKeras(vecsize, latent_dim)

Class implementing sequence-to-sequence (seq2seq) learning with keras.

Reference:

Ilya Sutskever, James Martens, Geoffrey Hinton, “Generating Text with Recurrent Neural Networks,” ICML (2011). [UToronto]

Ilya Sutskever, Oriol Vinyals, Quoc V. Le, “Sequence to Sequence Learning with Neural Networks,” arXiv:1409.3215 (2014). [arXiv]

Francois Chollet, “A ten-minute introduction to sequence-to-sequence learning in Keras,” The Keras Blog. [Keras]

Aurelien Geron, Hands-On Machine Learning with Scikit-Learn and TensorFlow (Sebastopol, CA: O’Reilly Media, 2017). [O’Reilly]

compile(optimizer='rmsprop', loss='categorical_crossentropy')

Compile the keras model after preparation running prepare_model().

Parameters:
  • optimizer (str) – optimizer for gradient descent. Options: sgd, rmsprop, adagrad, adadelta, adam, adamax, nadam. (Default: rmsprop)
  • loss (str) – loss function available from keras (Default: ‘categorical_crossentropy`)
Returns:

None

fit(encoder_input, decoder_input, decoder_output, batch_size=64, epochs=100)

Fit the sequence to learn the sequence-to-sequence (seq2seq) model.

Parameters:
  • encoder_input (numpy.array) – encoder input, a rank-3 tensor
  • decoder_input (numpy.array) – decoder input, a rank-3 tensor
  • decoder_output (numpy.array) – decoder output, a rank-3 tensor
  • batch_size (int) – batch size (Default: 64)
  • epochs (int) – number of epochs (Default: 100)
Returns:

None

loadmodel(prefix)

Load a trained model from various files.

To load a compact model, call load_compact_model().

Parameters:prefix (str) – prefix of the file path
Returns:None
prepare_model()

Prepare the keras model.

Returns:None
savemodel(prefix, final=False)

Save the trained models into multiple files.

To save it compactly, call save_compact_model().

If final is set to True, the model cannot be further trained.

If there is no trained model, a ModelNotTrainedException will be thrown.

Parameters:
  • prefix (str) – prefix of the file path
  • final (bool) – whether the model is final (that should not be trained further) (Default: False)
Returns:

None

Raise:

ModelNotTrainedException

shorttext.generators.seq2seq.s2skeras.loadSeq2SeqWithKeras(path, compact=True)

Load a trained Seq2SeqWithKeras class from file.

Parameters:
  • path (str) – path of the model file
  • compact (bool) – whether it is a compact model (Default: True)
Returns:

a Seq2SeqWithKeras class for sequence to sequence inference

Return type:

Seq2SeqWithKeras

Module shorttext.classifiers.embed.sumvec.VarNNSumEmbedVecClassification

class shorttext.classifiers.embed.sumvec.VarNNSumEmbedVecClassification.VarNNSumEmbeddedVecClassifier(wvmodel, vecsize=None, maxlen=15)

This is a wrapper for various neural network algorithms for supervised short text categorization. Each class label has a few short sentences, where each token is converted to an embedded vector, given by a pre-trained word-embedding model (e.g., Google Word2Vec model). The sentences are represented by an array. The type of neural network has to be passed when training, and it has to be of type keras.models.Sequential. The number of outputs of the models has to match the number of class labels in the training data. To perform prediction, the input short sentences is converted to a unit vector in the same way. The score is calculated according to the trained neural network model.

Examples of the models can be found in frameworks.

A pre-trained Google Word2Vec model can be downloaded here.

convert_traindata_embedvecs(classdict)

Convert the training text data into embedded matrix.

Convert the training text data into embedded matrix, where each short sentence is a normalized summed embedded vectors for all words.

Parameters:classdict (dict) – training data
Returns:tuples, consisting of class labels, matrix of embedded vectors, and corresponding outputs
Return type:(list, numpy.ndarray, list)
loadmodel(nameprefix)

Load a trained model from files.

Given the prefix of the file paths, load the model from files with name given by the prefix followed by “_classlabels.txt”, “.json”, and “.h5”.

If this has not been run, or a model was not trained by train(), a ModelNotTrainedException will be raised while performing prediction and saving the model.

Parameters:nameprefix (str) – prefix of the file path
Returns:None
savemodel(nameprefix)

Save the trained model into files.

Given the prefix of the file paths, save the model into files, with name given by the prefix. There will be three files produced, one name ending with “_classlabels.txt”, one name ending with “.json”, and one name ending with “.h5”. If there is no trained model, a ModelNotTrainedException will be thrown.

Parameters:nameprefix (str) – prefix of the file path
Returns:None
Raise:ModelNotTrainedException
score(shorttext)

Calculate the scores for all the class labels for the given short sentence.

Given a short sentence, calculate the classification scores for all class labels, returned as a dictionary with key being the class labels, and values being the scores. If the short sentence is empty, or if other numerical errors occur, the score will be numpy.nan.

If neither train() nor loadmodel() was run, it will raise ModelNotTrainedException.

Parameters:shorttext (str) – a short sentence
Returns:a dictionary with keys being the class labels, and values being the corresponding classification scores
Return type:dict
Raise:ModelNotTrainedException
shorttext_to_embedvec(shorttext)

Convert the short text into an averaged embedded vector representation.

Given a short sentence, it converts all the tokens into embedded vectors according to the given word-embedding model, sums them up, and normalize the resulting vector. It returns the resulting vector that represents this short sentence.

Parameters:shorttext (str) – a short sentence
Returns:an embedded vector that represents the short sentence
Return type:numpy.ndarray
train(classdict, kerasmodel, nb_epoch=10)

Train the classifier.

The training data and the corresponding keras model have to be given.

If this has not been run, or a model was not loaded by loadmodel(), a ModelNotTrainedException will be raised while performing prediction and saving the model.

Parameters:
  • classdict (dict) – training data
  • kerasmodel (keras.models.Sequential) – keras sequential model
  • nb_epoch (int) – number of steps / epochs in training
Returns:

None

word_to_embedvec(word)

Convert the given word into an embedded vector.

Given a word, return the corresponding embedded vector according to the word-embedding model. If there is no such word in the model, a vector with zero values are given.

Parameters:word (str) – a word
Returns:the corresponding embedded vector
Return type:numpy.ndarray
shorttext.classifiers.embed.sumvec.VarNNSumEmbedVecClassification.load_varnnsumvec_classifier(wvmodel, name, compact=True, vecsize=None)

Load a shorttext.classifiers.VarNNSumEmbeddedVecClassifier instance from file, given the pre-trained word-embedding model.

Parameters:
  • wvmodel (gensim.models.keyedvectors.KeyedVectors) – Word2Vec model
  • name (str) – name (if compact=True) or prefix (if compact=False) of the file path
  • whether model file is compact (Default (compact) – True)
  • vecsize (int) – length of embedded vectors in the model (Default: None, extracted directly from the word-embedding model)
Returns:

the classifier

Return type:

VarNNSumEmbeddedVecClassifier

Neural Networks

Module shorttext.classifiers.embed.sumvec.frameworks

shorttext.classifiers.embed.sumvec.frameworks.DenseWordEmbed(nb_labels, dense_nb_nodes=[], dense_actfcn=[], vecsize=300, reg_coef=0.1, final_activiation='softmax', optimizer='adam')

Return layers of dense neural network.

Return layers of dense neural network. This assumes the input to be a rank-1 vector.

Parameters:
  • nb_labels (int) – number of class labels
  • dense_nb_nodes (list) – number of nodes in each later (Default: [])
  • dense_actfcn (list) – activation functions for each layer (Default: [])
  • vecsize (int) – length of the embedded vectors in the model (Default: 300)
  • reg_coef (float) – regularization coefficient (Default: 0.1)
  • final_activiation (str) – activation function of the final layer (Default: softmax)
  • optimizer (str) – optimizer for gradient descent. Options: sgd, rmsprop, adagrad, adadelta, adam, adamax, nadam. (Default: adam)
Returns:

keras sequential model for dense neural network

Return type:

keras.models.Model

Utilities

Module shorttext.utils.kerasmodel_io

shorttext.utils.kerasmodel_io.load_model(nameprefix)

Load a keras sequential model from files.

Given the prefix of the file paths, load a keras sequential model from a JSON file and an HDF5 file.

Parameters:nameprefix (str) – Prefix of the paths of the model files
Returns:keras sequential model
Return type:keras.models.Model
shorttext.utils.kerasmodel_io.save_model(nameprefix, model)

Save a keras sequential model into files.

Given a keras sequential model, save the model with the given file path prefix. It saves the model into a JSON file, and an HDF5 file (.h5).

Parameters:
  • nameprefix (str) – Prefix of the paths of the model files
  • model (keras.models.Model) – keras sequential model to be saved
Returns:

None

Module shorttext.utils.gensim_corpora

shorttext.utils.gensim_corpora.generate_gensim_corpora(classdict, preprocess_and_tokenize=<function <lambda>>)

Generate gensim bag-of-words dictionary and corpus.

Given a text data, a dict with keys being the class labels, and the values being the list of short texts, in the same format output by shorttext.data.data_retrieval, return a gensim dictionary and corpus.

Parameters:
  • classdict (dict) – text data, a dict with keys being the class labels, and each value is a list of short texts
  • proprocess_and_tokenize (function) – preprocessor function, that takes a short sentence, and return a list of tokens (Default: shorttext.utils.tokenize)
Returns:

a tuple, consisting of a gensim dictionary, a corpus, and a list of class labels

Return type:

(gensim.corpora.Dictionary, list, list)

shorttext.utils.gensim_corpora.load_corpus(prefix)

Load gensim corpus and dictionary.

Parameters:prefix (str) – prefix of the file to load
Returns:corpus and dictionary
Return type:tuple
shorttext.utils.gensim_corpora.save_corpus(dictionary, corpus, prefix)

Save gensim corpus and dictionary.

Parameters:
  • dictionary (gensim.corpora.Dictionary) – dictionary to save
  • corpus (list) – corpus to save
  • prefix (str) – prefix of the files to save
Returns:

None

shorttext.utils.gensim_corpora.tokens_to_fracdict(tokens)

Return normalized bag-of-words (BOW) vectors.

Parameters:tokens (list) – list of tokens.
Returns:normalized vectors of counts of tokens as a dict
Return type:dict
shorttext.utils.gensim_corpora.update_corpus_labels(dictionary, corpus, newclassdict, preprocess_and_tokenize=<function <lambda>>)

Update corpus with additional training data.

With the additional training data, the dictionary and corpus are updated.

Parameters:
  • dictionary (gensim.corpora.Dictionary) – original dictionary
  • corpus (list) – original corpus
  • newclassdict (dict) – additional training data
  • preprocess_and_tokenize (function) – preprocessor function, that takes a short sentence, and return a list of tokens (Default: shorttext.utils.tokenize)
Returns:

a tuple, an updated corpus, and the new corpus (for updating model)

Return type:

tuple

Module shorttext.utils.compactmodel_io

This module contains general routines to zip all model files into one compact file. The model can be copied or transferred with handiness.

The methods and decorators in this module are called by other codes. It is not recommended for developers to call them directly.

class shorttext.utils.compactmodel_io.CompactIOMachine(infodict, prefix, suffices)

Base class that implements compact model I/O.

This is to replace the original compactio() decorator.

get_info()

Getting information for the dressed machine.

Returns:dictionary of the information for the dressed machine.
Return type:dict
load_compact_model(filename, *args, **kwargs)

Load the model in a compressed binary format.

Parameters:
  • filename (str) – name of the model file
  • args (dict) – arguments
  • kwargs (dict) – arguments
loadmodel(nameprefix)

Abstract method for loadmodel.

Parameters:nameprefix (str) – prefix of the model path
save_compact_model(filename, *args, **kwargs)

Save the model in a compressed binary format.

Parameters:
  • filename (str) – name of the model file
  • args (dict) – arguments
  • kwargs (dict) – arguments
savemodel(nameprefix)

Abstract method for savemodel.

Parameters:nameprefix (str) – prefix of the model path
shorttext.utils.compactmodel_io.get_model_classifier_name(filename)

Return the name of the classifier from a model file.

Read the file modelconfig.json in the compact model file, and return the name of the classifier.

Parameters:filename (str) – path of the model file
Returns:name of the classifier
Return type:str
shorttext.utils.compactmodel_io.get_model_config_field(filename, parameter)

Return the configuration parameter of a model file.

Read the file modelconfig.json in the compact model file, and return the value of a particular parameter.

Parameters:
  • filename (str) – path of the model file
  • parameter (str) – parameter to look in
Returns:

value of the parameter of this model

Return type:

str

shorttext.utils.compactmodel_io.load_compact_model(filename, loadfunc, prefix, infodict)

Load a model from a compact file that contains multiple files related to the model.

Parameters:
  • filename (str) – name of the model file
  • loadfunc (function) – method or function that performs the loading action. Only one argument (str), the prefix of the model files, to be passed.
  • prefix (str) – prefix of the names of the files
  • infodict (dict) – dictionary that holds information about the model. Must contain the key ‘classifier’.
Returns:

instance of the model

shorttext.utils.compactmodel_io.removedir(dir)

Remove all subdirectories and files under the specified path.

Parameters:dir – path of the directory to be clean
Returns:None
shorttext.utils.compactmodel_io.save_compact_model(filename, savefunc, prefix, suffices, infodict)

Save the model in one compact file by zipping all the related files.

Parameters:
  • filename (str) – name of the model file
  • savefunc (function) – method or function that performs the saving action. Only one argument (str), the prefix of the model files, to be passed.
  • prefix (str) – prefix of the names of the files related to the model
  • suffices (list) – list of suffices
  • infodict (dict) – dictionary that holds information about the model. Must contain the key ‘classifier’.
Returns:

None

Metrics

Module shorttext.metrics.dynprog

shorttext.metrics.dynprog.jaccard.soft_intersection_list(tokens1, tokens2)

Return the soft number of intersections between two lists of tokens.

Parameters:
  • tokens1 (list) – list of tokens.
  • tokens2 (list) – list of tokens.
Returns:

soft number of intersections.

Return type:

float

Module shorttext.metrics.wassersterin

shorttext.metrics.wasserstein.wordmoverdist.word_mover_distance_linprog(first_sent_tokens, second_sent_tokens, wvmodel, distancefunc=<function euclidean>)

Compute the Word Mover’s distance (WMD) between the two given lists of tokens, and return the LP problem class.

Using methods of linear programming, supported by PuLP, calculate the WMD between two lists of words. A word-embedding model has to be provided. The whole scipy.optimize.Optimize object is returned.

Reference: Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, Kilian Q. Weinberger, “From Word Embeddings to Document Distances,” ICML (2015).

Parameters:
  • first_sent_tokens (list) – first list of tokens.
  • second_sent_tokens (list) – second list of tokens.
  • wvmodel (gensim.models.keyedvectors.KeyedVectors) – word-embedding models.
  • distancefunc (function) – distance function that takes two numpy ndarray.
Returns:

the whole result of the linear programming problem

Return type:

scipy.optimize.OptimizeResult

Spell Correction

Module shorttext.spell

class shorttext.spell.basespellcorrector.SpellCorrector

Base class for all spell corrector.

This class is not implemented; this is an “abstract class.”

correct(word)

Recommend a spell correction to given the word.

Parameters:word (str) – word to be checked
Returns:recommended correction
Return type:str
train(text)

Train the spell corrector with the given corpus.

Parameters:text (str) – training corpus

Home: Homepage of shorttext