Deep Neural Networks with WordEmbedding¶
Wrapper for Neural Networks for WordEmbedding Vectors¶
In this package, there is a class that serves a wrapper for various neural network algorithms
for supervised short text categorization:
shorttext.classifiers.VarNNEmbeddedVecClassifier
.
Each class label has a few short sentences, where each token is converted
to an embedded vector, given by a pretrained wordembedding model (e.g., Google Word2Vec model).
The sentences are represented by a matrix, or rank2 array.
The type of neural network has to be passed when training, and it has to be of
type keras.models.Sequential
. The number of outputs of the models has to match
the number of class labels in the training data.
To perform prediction, the input short sentences is converted to a unit vector
in the same way. The score is calculated according to the trained neural network model.
Some of the neural networks can be found within the module :module:`shorttext.classifiers.embed.nnlib.frameworks` and they are good for short text or document classification. Of course, users can supply their own neural networks, written in keras.
A pretrained Google Word2Vec model can be downloaded here, and a pretrained Facebook FastText model can be downloaded here.
See: Word Embedding Models .
Import the package:
>>> import shorttext
To load the Word2Vec model,
>>> wvmodel = shorttext.utils.load_word2vec_model('/path/to/GoogleNewsvectorsnegative300.bin.gz')
Then load the training data
>>> trainclassdict = shorttext.data.subjectkeywords()
Then we choose a neural network. We choose ConvNet:
>>> kmodel = shorttext.classifiers.frameworks.CNNWordEmbed(len(trainclassdict.keys()), vecsize=300)
Initialize the classifier:
>>> classifier = shorttext.classifiers.VarNNEmbeddedVecClassifier(wvmodel)

class
shorttext.classifiers.embed.nnlib.VarNNEmbedVecClassification.
VarNNEmbeddedVecClassifier
(wvmodel, vecsize=None, maxlen=15, with_gensim=False)¶ This is a wrapper for various neural network algorithms for supervised short text categorization. Each class label has a few short sentences, where each token is converted to an embedded vector, given by a pretrained wordembedding model (e.g., Google Word2Vec model). The sentences are represented by a matrix, or rank2 array. The type of neural network has to be passed when training, and it has to be of type
keras.models.Sequential
. The number of outputs of the models has to match the number of class labels in the training data. To perform prediction, the input short sentences is converted to a unit vector in the same way. The score is calculated according to the trained neural network model.Examples of the models can be found in frameworks.
A pretrained Google Word2Vec model can be downloaded here.

convert_trainingdata_matrix
(classdict)¶ Convert the training data into format put into the neural networks.
Convert the training data into format put into the neural networks. This is called by
train()
.Parameters: classdict (dict) – training data Returns: a tuple of three, containing a list of class labels, matrix of embedded word vectors, and corresponding outputs Return type: (list, numpy.ndarray, list)

loadmodel
(nameprefix)¶ Load a trained model from files.
Given the prefix of the file paths, load the model from files with name given by the prefix followed by “_classlabels.txt”, “.json” and “.h5”. For shorttext>=0.4.0, a file with extension “_config.json” would also be used.
If this has not been run, or a model was not trained by
train()
, a ModelNotTrainedException will be raised while performing prediction or saving the model.Parameters: nameprefix (str) – prefix of the file path Returns: None

savemodel
(nameprefix)¶ Save the trained model into files.
Given the prefix of the file paths, save the model into files, with name given by the prefix. There will be three files produced, one name ending with “_classlabels.txt”, one name ending with “.json”, and one name ending with “.h5”. For shorttext>=0.4.0, another file with extension “_config.json” would be created.
If there is no trained model, a ModelNotTrainedException will be thrown.
Parameters: nameprefix (str) – prefix of the file path Returns: None Raise: ModelNotTrainedException

score
(shorttext)¶ Calculate the scores for all the class labels for the given short sentence.
Given a short sentence, calculate the classification scores for all class labels, returned as a dictionary with key being the class labels, and values being the scores. If the short sentence is empty, or if other numerical errors occur, the score will be numpy.nan. If neither
train()
norloadmodel()
was run, it will raise ModelNotTrainedException.Parameters: shorttext (str) – a short sentence Returns: a dictionary with keys being the class labels, and values being the corresponding classification scores Return type: dict Raise: ModelNotTrainedException

shorttext_to_matrix
(shorttext)¶ Convert the short text into a matrix with wordembedding representation.
Given a short sentence, it converts all the tokens into embedded vectors according to the given wordembedding model, and put them into a matrix. If a word is not in the model, that row will be filled with zero.
Parameters: shorttext (str) – a short sentence Returns: a matrix of embedded vectors that represent all the tokens in the sentence Return type: numpy.ndarray

train
(classdict, kerasmodel, nb_epoch=10)¶ Train the classifier.
The training data and the corresponding keras model have to be given.
If this has not been run, or a model was not loaded by
loadmodel()
, a ModelNotTrainedException will be raised.Parameters:  classdict (dict) – training data
 kerasmodel (keras.models.Sequential) – keras sequential model
 nb_epoch (int) – number of steps / epochs in training
Returns: None

word_to_embedvec
(word)¶ Convert the given word into an embedded vector.
Given a word, return the corresponding embedded vector according to the wordembedding model. If there is no such word in the model, a vector with zero values are given.
Parameters: word (str) – a word Returns: the corresponding embedded vector Return type: numpy.ndarray

Then train the classifier:
>>> classifier.train(trainclassdict, kmodel)
Epoch 1/10
45/45 [==============================]  0s  loss: 1.0578
Epoch 2/10
45/45 [==============================]  0s  loss: 0.5536
Epoch 3/10
45/45 [==============================]  0s  loss: 0.3437
Epoch 4/10
45/45 [==============================]  0s  loss: 0.2282
Epoch 5/10
45/45 [==============================]  0s  loss: 0.1658
Epoch 6/10
45/45 [==============================]  0s  loss: 0.1273
Epoch 7/10
45/45 [==============================]  0s  loss: 0.1052
Epoch 8/10
45/45 [==============================]  0s  loss: 0.0961
Epoch 9/10
45/45 [==============================]  0s  loss: 0.0839
Epoch 10/10
45/45 [==============================]  0s  loss: 0.0743
Then the model is ready for classification, like:
>>> classifier.score('artificial intelligence')
{'mathematics': 0.57749695, 'physics': 0.33749574, 'theology': 0.085007325}
The trained model can be saved:
>>> classifier.save_compact_model('/path/to/nnlibvec_convnet_subdata.bin')
To load it, enter:
>>> classifier2 = shorttext.classifiers.load_varnnlibvec_classifier(wvmodel, '/path/to/nnlibvec_convnet_subdata.bin')

shorttext.classifiers.embed.nnlib.VarNNEmbedVecClassification.
load_varnnlibvec_classifier
(wvmodel, name, compact=True, vecsize=None)¶ Load a
shorttext.classifiers.VarNNEmbeddedVecClassifier
instance from file, given the pretrained wordembedding model.Parameters:  wvmodel (gensim.models.keyedvectors.KeyedVectors) – Word2Vec model
 name (str) – name (if compact=True) or prefix (if compact=False) of the file path
 whether model file is compact (Default (compact) – True)
 vecsize (int) – length of embedded vectors in the model (Default: None, extracted directly from the wordembedding model)
Returns: the classifier
Return type:
Provided Neural Networks¶
There are three neural networks available in this package for the use in
shorttext.classifiers.VarNNEmbeddedVecClassifier
,
and they are available in the module shorttext.classifiers.frameworks.

shorttext.classifiers.embed.nnlib.frameworks.
CLSTMWordEmbed
(nb_labels, wvmodel=None, nb_filters=1200, n_gram=2, maxlen=15, vecsize=300, cnn_dropout=0.0, nb_rnnoutdim=1200, rnn_dropout=0.2, final_activation='softmax', dense_wl2reg=0.0, dense_bl2reg=0.0, optimizer='adam')¶ Returns the CLSTM neural networks for wordembedded vectors.
Reference: Chunting Zhou, Chonglin Sun, Zhiyuan Liu, Francis Lau, “A CLSTM Neural Network for Text Classification,” (arXiv:1511.08630). [arXiv]
Parameters:  nb_labels (int) – number of class labels
 wvmodel (gensim.models.keyedvectors.KeyedVectors) – pretrained Gensim word2vec model
 nb_filters (int) – number of filters (Default: 1200)
 n_gram (int) – ngram, or window size of CNN/ConvNet (Default: 2)
 maxlen (int) – maximum number of words in a sentence (Default: 15)
 vecsize (int) – length of the embedded vectors in the model (Default: 300)
 cnn_dropout (float) – dropout rate for CNN/ConvNet (Default: 0.0)
 nb_rnnoutdim (int) – output dimension for the LSTM networks (Default: 1200)
 rnn_dropout (float) – dropout rate for LSTM (Default: 0.2)
 final_activation (str) – activation function. Options: softplus, softsign, relu, tanh, sigmoid, hard_sigmoid, linear. (Default: ‘softmax’)
 dense_wl2reg (float) – L2 regularization coefficient (Default: 0.0)
 dense_bl2reg (float) – L2 regularization coefficient for bias (Default: 0.0)
 optimizer (str) – optimizer for gradient descent. Options: sgd, rmsprop, adagrad, adadelta, adam, adamax, nadam. (Default: adam)
Returns: keras sequantial model for CNN/ConvNet for WordEmbeddings
Return type: keras.models.Model

shorttext.classifiers.embed.nnlib.frameworks.
CNNWordEmbed
(nb_labels, wvmodel=None, nb_filters=1200, n_gram=2, maxlen=15, vecsize=300, cnn_dropout=0.0, final_activation='softmax', dense_wl2reg=0.0, dense_bl2reg=0.0, optimizer='adam')¶ Returns the convolutional neural network (CNN/ConvNet) for wordembedded vectors.
Reference: Yoon Kim, “Convolutional Neural Networks for Sentence Classification,” EMNLP 2014, 17461751 (arXiv:1408.5882). [arXiv]
Parameters:  nb_labels (int) – number of class labels
 wvmodel (gensim.models.keyedvectors.KeyedVectors) – pretrained Gensim word2vec model
 nb_filters (int) – number of filters (Default: 1200)
 n_gram (int) – ngram, or window size of CNN/ConvNet (Default: 2)
 maxlen (int) – maximum number of words in a sentence (Default: 15)
 vecsize (int) – length of the embedded vectors in the model (Default: 300)
 cnn_dropout (float) – dropout rate for CNN/ConvNet (Default: 0.0)
 final_activation (str) – activation function. Options: softplus, softsign, relu, tanh, sigmoid, hard_sigmoid, linear. (Default: ‘softmax’)
 dense_wl2reg (float) – L2 regularization coefficient (Default: 0.0)
 dense_bl2reg (float) – L2 regularization coefficient for bias (Default: 0.0)
 optimizer (str) – optimizer for gradient descent. Options: sgd, rmsprop, adagrad, adadelta, adam, adamax, nadam. (Default: adam)
Returns: keras model (Sequential or`Model`) for CNN/ConvNet for WordEmbeddings
Return type: keras.models.Model

shorttext.classifiers.embed.nnlib.frameworks.
DoubleCNNWordEmbed
(nb_labels, wvmodel=None, nb_filters_1=1200, nb_filters_2=600, n_gram=2, filter_length_2=10, maxlen=15, vecsize=300, cnn_dropout_1=0.0, cnn_dropout_2=0.0, final_activation='softmax', dense_wl2reg=0.0, dense_bl2reg=0.0, optimizer='adam')¶ Returns the doublelayered convolutional neural network (CNN/ConvNet) for wordembedded vectors.
Parameters:  nb_labels (int) – number of class labels
 wvmodel (gensim.models.keyedvectors.KeyedVectors) – pretrained Gensim word2vec model
 nb_filters_1 (int) – number of filters for the first CNN/ConvNet layer (Default: 1200)
 nb_filters_2 (int) – number of filters for the second CNN/ConvNet layer (Default: 600)
 n_gram (int) – ngram, or window size of first CNN/ConvNet (Default: 2)
 filter_length_2 (int) – window size for second CNN/ConvNet layer (Default: 10)
 maxlen (int) – maximum number of words in a sentence (Default: 15)
 vecsize (int) – length of the embedded vectors in the model (Default: 300)
 cnn_dropout_1 (float) – dropout rate for the first CNN/ConvNet layer (Default: 0.0)
 cnn_dropout_2 (float) – dropout rate for the second CNN/ConvNet layer (Default: 0.0)
 final_activation (str) – activation function. Options: softplus, softsign, relu, tanh, sigmoid, hard_sigmoid, linear. (Default: ‘softmax’)
 dense_wl2reg (float) – L2 regularization coefficient (Default: 0.0)
 dense_bl2reg (float) – L2 regularization coefficient for bias (Default: 0.0)
 optimizer (str) – optimizer for gradient descent. Options: sgd, rmsprop, adagrad, adadelta, adam, adamax, nadam. (Default: adam)
Returns: keras sequantial model for CNN/ConvNet for WordEmbeddings
Return type: keras.models.Model
ConvNet (Convolutional Neural Network)¶
This neural network for supervised learning is using convolutional neural network (ConvNet), as demonstrated in Kim’s paper.
The function in the frameworks returns a keras.models.Sequential
or keras.models.Model
. Its input parameters are:
The parameter maxlen defines the maximum length of the sentences. If the sentence has less than maxlen words, then the empty words will be filled with zero vectors.
>>> kmodel = fr.CNNWordEmbed(len(trainclassdict.keys()), vecsize=wvmodel.vector_size)
Double ConvNet¶
This neural network is nothing more than two ConvNet layers. The function in the frameworks returns a keras.models.Sequential
or keras.models.Model
. Its input parameters are:
The parameter maxlen defines the maximum length of the sentences. If the sentence has less than maxlen words, then the empty words will be filled with zero vectors.
>>> kmodel = fr.DoubleCNNWordEmbed(len(trainclassdict.keys()), vecsize=wvmodel.vector_size)
CLSTM (Convolutional Long ShortTerm Memory)¶
This neural network for supervised learning is using CLSTM, according to the paper written by Zhou et. al. It is a neural network with ConvNet as the first layer, and then followed by LSTM (long shortterm memory), a type of recurrent neural network (RNN).
The function in the frameworks returns a keras.models.Sequential
or keras.models.Model
.
The parameter maxlen defines the maximum length of the sentences. If the sentence has less than maxlen words, then the empty words will be filled with zero vectors.
>>> kmodel = fr.CLSTMWordEmbed(len(trainclassdict.keys()), vecsize=wvmodel.vector_size)
UserDefined Neural Network¶
Users can define their own neural network for use in the classifier wrapped by
shorttext.classifiers.VarNNEmbeddedVecClassifier
as long as the following criteria are met:
 the input matrix is
numpy.ndarray
, and of shape (maxlen, vecsize), where
maxlen is the maximum length of the sentence, and vecsize is the number of dimensions of the embedded vectors. The output is a onedimensional array, of size equal to the number of classes provided by the training data. The order of the class labels is assumed to be the same as the order of the given training data (stored as a Python dictionary).
Putting Word2Vec Model As an Input Keras Layer (Deprecated)¶
This functionality is removed since release 0.5.11, due to the following reasons:
 keras changed its code that produces this bug;
 the layer is consuming memory;
 only Word2Vec is supported; and
 the results are incorrect.
Reference¶
Chunting Zhou, Chonglin Sun, Zhiyuan Liu, Francis Lau, “A CLSTM Neural Network for Text Classification,” (arXiv:1511.08630). [arXiv]
“CS231n Convolutional Neural Networks for Visual Recognition,” Stanford Online Course. [link]
Nal Kalchbrenner, Edward Grefenstette, Phil Blunsom, “A Convolutional Neural Network for Modelling Sentences,” Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 655665 (2014). [arXiv]
Tal Perry, “Convolutional Methods for Text,” Medium (2017). [Medium]
Yoon Kim, “Convolutional Neural Networks for Sentence Classification,” EMNLP 2014, 17461751 (arXiv:1408.5882). [arXiv]
Zackary C. Lipton, John Berkowitz, “A Critical Review of Recurrent Neural Networks for Sequence Learning,” arXiv:1506.00019 (2015). [arXiv]
Home: Homepage of shorttext