API¶
API unlisted in tutorials are listed here.
Shorttext Models Smart Loading¶
-
shorttext.smartload.
smartload_compact_model
(filename, wvmodel, preprocessor=<function text_preprocessor.<locals>.<lambda>>, vecsize=None)¶ Load appropriate classifier or model from the binary model.
The second parameter, wvmodel, can be set to None if no Word2Vec model is needed.
Parameters: - filename (str) – path of the compact model file
- wvmodel (gensim.models.keyedvectors.KeyedVectors) – Word2Vec model
- preprocessor (function) – text preprocessor (Default: shorttext.utils.textpreprocess.standard_text_preprocessor_1)
- vecsize (int) – length of embedded vectors in the model (Default: None, extracted directly from the word-embedding model)
Returns: appropriate classifier or model
Raise: AlgorithmNotExistException
Supervised Classification using Word Embedding¶
Module shorttext.generators.seq2seq.s2skeras¶
-
class
shorttext.generators.seq2seq.s2skeras.
Seq2SeqWithKeras
(vecsize, latent_dim)¶ Class implementing sequence-to-sequence (seq2seq) learning with keras.
Reference:
Ilya Sutskever, James Martens, Geoffrey Hinton, “Generating Text with Recurrent Neural Networks,” ICML (2011). [UToronto]
Ilya Sutskever, Oriol Vinyals, Quoc V. Le, “Sequence to Sequence Learning with Neural Networks,” arXiv:1409.3215 (2014). [arXiv]
Francois Chollet, “A ten-minute introduction to sequence-to-sequence learning in Keras,” The Keras Blog. [Keras]
Aurelien Geron, Hands-On Machine Learning with Scikit-Learn and TensorFlow (Sebastopol, CA: O’Reilly Media, 2017). [O’Reilly]
-
compile
(optimizer='rmsprop', loss='categorical_crossentropy')¶ Compile the keras model after preparation running
prepare_model()
.Parameters: - optimizer (str) – optimizer for gradient descent. Options: sgd, rmsprop, adagrad, adadelta, adam, adamax, nadam. (Default: rmsprop)
- loss (str) – loss function available from keras (Default: ‘categorical_crossentropy`)
Returns: None
-
fit
(encoder_input, decoder_input, decoder_output, batch_size=64, epochs=100)¶ Fit the sequence to learn the sequence-to-sequence (seq2seq) model.
Parameters: - encoder_input (numpy.array) – encoder input, a rank-3 tensor
- decoder_input (numpy.array) – decoder input, a rank-3 tensor
- decoder_output (numpy.array) – decoder output, a rank-3 tensor
- batch_size (int) – batch size (Default: 64)
- epochs (int) – number of epochs (Default: 100)
Returns: None
-
loadmodel
(prefix)¶ Load a trained model from various files.
To load a compact model, call
load_compact_model()
.Parameters: prefix (str) – prefix of the file path Returns: None
-
prepare_model
()¶ Prepare the keras model.
Returns: None
-
savemodel
(prefix, final=False)¶ Save the trained models into multiple files.
To save it compactly, call
save_compact_model()
.If final is set to True, the model cannot be further trained.
If there is no trained model, a ModelNotTrainedException will be thrown.
Parameters: - prefix (str) – prefix of the file path
- final (bool) – whether the model is final (that should not be trained further) (Default: False)
Returns: None
Raise: ModelNotTrainedException
-
-
shorttext.generators.seq2seq.s2skeras.
loadSeq2SeqWithKeras
(path, compact=True)¶ Load a trained Seq2SeqWithKeras class from file.
Parameters: - path (str) – path of the model file
- compact (bool) – whether it is a compact model (Default: True)
Returns: a Seq2SeqWithKeras class for sequence to sequence inference
Return type:
Module shorttext.classifiers.embed.sumvec.VarNNSumEmbedVecClassification¶
-
class
shorttext.classifiers.embed.sumvec.VarNNSumEmbedVecClassification.
VarNNSumEmbeddedVecClassifier
(wvmodel, vecsize=None, maxlen=15)¶ This is a wrapper for various neural network algorithms for supervised short text categorization. Each class label has a few short sentences, where each token is converted to an embedded vector, given by a pre-trained word-embedding model (e.g., Google Word2Vec model). The sentences are represented by an array. The type of neural network has to be passed when training, and it has to be of type
keras.models.Sequential
. The number of outputs of the models has to match the number of class labels in the training data. To perform prediction, the input short sentences is converted to a unit vector in the same way. The score is calculated according to the trained neural network model.Examples of the models can be found in frameworks.
A pre-trained Google Word2Vec model can be downloaded here.
-
convert_traindata_embedvecs
(classdict)¶ Convert the training text data into embedded matrix.
Convert the training text data into embedded matrix, where each short sentence is a normalized summed embedded vectors for all words.
Parameters: classdict (dict) – training data Returns: tuples, consisting of class labels, matrix of embedded vectors, and corresponding outputs Return type: (list, numpy.ndarray, list)
-
loadmodel
(nameprefix)¶ Load a trained model from files.
Given the prefix of the file paths, load the model from files with name given by the prefix followed by “_classlabels.txt”, “.json”, and “.h5”.
If this has not been run, or a model was not trained by
train()
, a ModelNotTrainedException will be raised while performing prediction and saving the model.Parameters: nameprefix (str) – prefix of the file path Returns: None
-
savemodel
(nameprefix)¶ Save the trained model into files.
Given the prefix of the file paths, save the model into files, with name given by the prefix. There will be three files produced, one name ending with “_classlabels.txt”, one name ending with “.json”, and one name ending with “.h5”. If there is no trained model, a ModelNotTrainedException will be thrown.
Parameters: nameprefix (str) – prefix of the file path Returns: None Raise: ModelNotTrainedException
-
score
(shorttext)¶ Calculate the scores for all the class labels for the given short sentence.
Given a short sentence, calculate the classification scores for all class labels, returned as a dictionary with key being the class labels, and values being the scores. If the short sentence is empty, or if other numerical errors occur, the score will be numpy.nan.
If neither
train()
norloadmodel()
was run, it will raise ModelNotTrainedException.Parameters: shorttext (str) – a short sentence Returns: a dictionary with keys being the class labels, and values being the corresponding classification scores Return type: dict Raise: ModelNotTrainedException
-
shorttext_to_embedvec
(shorttext)¶ Convert the short text into an averaged embedded vector representation.
Given a short sentence, it converts all the tokens into embedded vectors according to the given word-embedding model, sums them up, and normalize the resulting vector. It returns the resulting vector that represents this short sentence.
Parameters: shorttext (str) – a short sentence Returns: an embedded vector that represents the short sentence Return type: numpy.ndarray
-
train
(classdict, kerasmodel, nb_epoch=10)¶ Train the classifier.
The training data and the corresponding keras model have to be given.
If this has not been run, or a model was not loaded by
loadmodel()
, a ModelNotTrainedException will be raised while performing prediction and saving the model.Parameters: - classdict (dict) – training data
- kerasmodel (keras.models.Sequential) – keras sequential model
- nb_epoch (int) – number of steps / epochs in training
Returns: None
-
word_to_embedvec
(word)¶ Convert the given word into an embedded vector.
Given a word, return the corresponding embedded vector according to the word-embedding model. If there is no such word in the model, a vector with zero values are given.
Parameters: word (str) – a word Returns: the corresponding embedded vector Return type: numpy.ndarray
-
-
shorttext.classifiers.embed.sumvec.VarNNSumEmbedVecClassification.
load_varnnsumvec_classifier
(wvmodel, name, compact=True, vecsize=None)¶ Load a
shorttext.classifiers.VarNNSumEmbeddedVecClassifier
instance from file, given the pre-trained word-embedding model.Parameters: - wvmodel (gensim.models.keyedvectors.KeyedVectors) – Word2Vec model
- name (str) – name (if compact=True) or prefix (if compact=False) of the file path
- whether model file is compact (Default (compact) – True)
- vecsize (int) – length of embedded vectors in the model (Default: None, extracted directly from the word-embedding model)
Returns: the classifier
Return type:
Neural Networks¶
Module shorttext.classifiers.embed.sumvec.frameworks¶
-
shorttext.classifiers.embed.sumvec.frameworks.
DenseWordEmbed
(nb_labels, dense_nb_nodes=[], dense_actfcn=[], vecsize=300, reg_coef=0.1, final_activiation='softmax', optimizer='adam')¶ Return layers of dense neural network.
Return layers of dense neural network. This assumes the input to be a rank-1 vector.
Parameters: - nb_labels (int) – number of class labels
- dense_nb_nodes (list) – number of nodes in each later (Default: [])
- dense_actfcn (list) – activation functions for each layer (Default: [])
- vecsize (int) – length of the embedded vectors in the model (Default: 300)
- reg_coef (float) – regularization coefficient (Default: 0.1)
- final_activiation (str) – activation function of the final layer (Default: softmax)
- optimizer (str) – optimizer for gradient descent. Options: sgd, rmsprop, adagrad, adadelta, adam, adamax, nadam. (Default: adam)
Returns: keras sequential model for dense neural network
Return type: keras.models.Model
Utilities¶
Module shorttext.utils.kerasmodel_io¶
-
shorttext.utils.kerasmodel_io.
load_model
(nameprefix)¶ Load a keras sequential model from files.
Given the prefix of the file paths, load a keras sequential model from a JSON file and an HDF5 file.
Parameters: nameprefix (str) – Prefix of the paths of the model files Returns: keras sequential model Return type: keras.models.Model
-
shorttext.utils.kerasmodel_io.
save_model
(nameprefix, model)¶ Save a keras sequential model into files.
Given a keras sequential model, save the model with the given file path prefix. It saves the model into a JSON file, and an HDF5 file (.h5).
Parameters: - nameprefix (str) – Prefix of the paths of the model files
- model (keras.models.Model) – keras sequential model to be saved
Returns: None
Module shorttext.utils.gensim_corpora¶
-
shorttext.utils.gensim_corpora.
generate_gensim_corpora
(classdict, preprocess_and_tokenize=<function <lambda>>)¶ Generate gensim bag-of-words dictionary and corpus.
Given a text data, a dict with keys being the class labels, and the values being the list of short texts, in the same format output by shorttext.data.data_retrieval, return a gensim dictionary and corpus.
Parameters: - classdict (dict) – text data, a dict with keys being the class labels, and each value is a list of short texts
- proprocess_and_tokenize (function) – preprocessor function, that takes a short sentence, and return a list of tokens (Default: shorttext.utils.tokenize)
Returns: a tuple, consisting of a gensim dictionary, a corpus, and a list of class labels
Return type: (gensim.corpora.Dictionary, list, list)
-
shorttext.utils.gensim_corpora.
load_corpus
(prefix)¶ Load gensim corpus and dictionary.
Parameters: prefix (str) – prefix of the file to load Returns: corpus and dictionary Return type: tuple
-
shorttext.utils.gensim_corpora.
save_corpus
(dictionary, corpus, prefix)¶ Save gensim corpus and dictionary.
Parameters: - dictionary (gensim.corpora.Dictionary) – dictionary to save
- corpus (list) – corpus to save
- prefix (str) – prefix of the files to save
Returns: None
-
shorttext.utils.gensim_corpora.
tokens_to_fracdict
(tokens)¶ Return normalized bag-of-words (BOW) vectors.
Parameters: tokens (list) – list of tokens. Returns: normalized vectors of counts of tokens as a dict Return type: dict
-
shorttext.utils.gensim_corpora.
update_corpus_labels
(dictionary, corpus, newclassdict, preprocess_and_tokenize=<function <lambda>>)¶ Update corpus with additional training data.
With the additional training data, the dictionary and corpus are updated.
Parameters: - dictionary (gensim.corpora.Dictionary) – original dictionary
- corpus (list) – original corpus
- newclassdict (dict) – additional training data
- preprocess_and_tokenize (function) – preprocessor function, that takes a short sentence, and return a list of tokens (Default: shorttext.utils.tokenize)
Returns: a tuple, an updated corpus, and the new corpus (for updating model)
Return type: tuple
Module shorttext.utils.compactmodel_io¶
This module contains general routines to zip all model files into one compact file. The model can be copied or transferred with handiness.
The methods and decorators in this module are called by other codes. It is not recommended for developers to call them directly.
-
class
shorttext.utils.compactmodel_io.
CompactIOMachine
(infodict, prefix, suffices)¶ Base class that implements compact model I/O.
This is to replace the original
compactio()
decorator.-
get_info
()¶ Getting information for the dressed machine.
Returns: dictionary of the information for the dressed machine. Return type: dict
-
load_compact_model
(filename, *args, **kwargs)¶ Load the model in a compressed binary format.
Parameters: - filename (str) – name of the model file
- args (dict) – arguments
- kwargs (dict) – arguments
-
loadmodel
(nameprefix)¶ Abstract method for loadmodel.
Parameters: nameprefix (str) – prefix of the model path
-
save_compact_model
(filename, *args, **kwargs)¶ Save the model in a compressed binary format.
Parameters: - filename (str) – name of the model file
- args (dict) – arguments
- kwargs (dict) – arguments
-
savemodel
(nameprefix)¶ Abstract method for savemodel.
Parameters: nameprefix (str) – prefix of the model path
-
-
shorttext.utils.compactmodel_io.
get_model_classifier_name
(filename)¶ Return the name of the classifier from a model file.
Read the file modelconfig.json in the compact model file, and return the name of the classifier.
Parameters: filename (str) – path of the model file Returns: name of the classifier Return type: str
-
shorttext.utils.compactmodel_io.
get_model_config_field
(filename, parameter)¶ Return the configuration parameter of a model file.
Read the file modelconfig.json in the compact model file, and return the value of a particular parameter.
Parameters: - filename (str) – path of the model file
- parameter (str) – parameter to look in
Returns: value of the parameter of this model
Return type: str
-
shorttext.utils.compactmodel_io.
load_compact_model
(filename, loadfunc, prefix, infodict)¶ Load a model from a compact file that contains multiple files related to the model.
Parameters: - filename (str) – name of the model file
- loadfunc (function) – method or function that performs the loading action. Only one argument (str), the prefix of the model files, to be passed.
- prefix (str) – prefix of the names of the files
- infodict (dict) – dictionary that holds information about the model. Must contain the key ‘classifier’.
Returns: instance of the model
-
shorttext.utils.compactmodel_io.
removedir
(dir)¶ Remove all subdirectories and files under the specified path.
Parameters: dir – path of the directory to be clean Returns: None
-
shorttext.utils.compactmodel_io.
save_compact_model
(filename, savefunc, prefix, suffices, infodict)¶ Save the model in one compact file by zipping all the related files.
Parameters: - filename (str) – name of the model file
- savefunc (function) – method or function that performs the saving action. Only one argument (str), the prefix of the model files, to be passed.
- prefix (str) – prefix of the names of the files related to the model
- suffices (list) – list of suffices
- infodict (dict) – dictionary that holds information about the model. Must contain the key ‘classifier’.
Returns: None
Metrics¶
Module shorttext.metrics.dynprog¶
-
shorttext.metrics.dynprog.jaccard.
soft_intersection_list
(tokens1, tokens2)¶ Return the soft number of intersections between two lists of tokens.
Parameters: - tokens1 (list) – list of tokens.
- tokens2 (list) – list of tokens.
Returns: soft number of intersections.
Return type: float
Module shorttext.metrics.wassersterin¶
-
shorttext.metrics.wasserstein.wordmoverdist.
word_mover_distance_linprog
(first_sent_tokens, second_sent_tokens, wvmodel, distancefunc=<function euclidean>)¶ Compute the Word Mover’s distance (WMD) between the two given lists of tokens, and return the LP problem class.
Using methods of linear programming, supported by PuLP, calculate the WMD between two lists of words. A word-embedding model has to be provided. The whole scipy.optimize.Optimize object is returned.
Reference: Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, Kilian Q. Weinberger, “From Word Embeddings to Document Distances,” ICML (2015).
Parameters: - first_sent_tokens (list) – first list of tokens.
- second_sent_tokens (list) – second list of tokens.
- wvmodel (gensim.models.keyedvectors.KeyedVectors) – word-embedding models.
- distancefunc (function) – distance function that takes two numpy ndarray.
Returns: the whole result of the linear programming problem
Return type: scipy.optimize.OptimizeResult
Spell Correction¶
Module shorttext.spell¶
-
class
shorttext.spell.basespellcorrector.
SpellCorrector
¶ Base class for all spell corrector.
This class is not implemented; this is an “abstract class.”
-
correct
(word)¶ Recommend a spell correction to given the word.
Parameters: word (str) – word to be checked Returns: recommended correction Return type: str
-
train
(text)¶ Train the spell corrector with the given corpus.
Parameters: text (str) – training corpus
-
Home: Homepage of shorttext