Maximum Entropy (MaxEnt) Classifier

Maxent

Maximum entropy (maxent) classifier has been a popular text classifier, by parameterizing the model to achieve maximum categorical entropy, with the constraint that the resulting probability on the training data with the model being equal to the real distribution.

The maxent classifier in shorttext is impleneted by keras. The optimization algorithm is defaulted to be the Adam optimizer, although other gradient-based or momentum-based optimizers can be used. The traditional methods such as generative iterative scaling (GIS) or L-BFGS cannot be used here.

To use the maxent classifier, import the package:

>>> import shorttext
>>> from shorttext.classifiers import MaxEntClassifier

Loading NIH reports as an example:

>>> classdict = shorttext.data.nihreports()

The classifier can be instantiated by:

>>> classifier = MaxEntClassifier()

Train the classifier:

>>> classifier.train(classdict, nb_epochs=1000)

After training, it can be used for classification, such as

>>> classifier.score('cancer immunology')   # NCI tops the score
>>> classifier.score('children health')     # NIAID tops the score
>>> classifier.score('Alzheimer disease and aging')    # NIAID tops the score

To save the model,

>>> classifier.save_compact_model('/path/to/filename.bin')

To load the model to be a classifier, enter:

>>> classifier2 = shorttext.classifiers.load_maxent_classifier('/path/to/filename.bin')
class shorttext.classifiers.bow.maxent.MaxEntClassification.MaxEntClassifier(preprocessor=<function MaxEntClassifier.<lambda>>)

This is a classifier that implements the principle of maximum entropy.

Reference: * Adam L. Berger, Stephen A. Della Pietra, Vincent J. Della Pietra, “A Maximum Entropy Approach to Natural Language Processing,” Computational Linguistics 22(1): 39-72 (1996).

convert_classdict_to_XY(classdict)

Convert the training data into sparse matrices for training.

Parameters:classdict (dict) – training data
Returns:a tuple, consisting of sparse matrices for X (training data) and y (the labels of the training data)
Return type:tuple
index_classlabels()

Index the class outcome labels.

Index the class outcome labels into integers, for neural network implementation.

loadmodel(nameprefix)

Load a trained model from files.

Given the prefix of the file paths, load the model from files with name given by the prefix followed by “_classlabels.txt”, “.json”, “.h5”, “_labelidx.pkl”, and “_dictionary.dict”.

If this has not been run, or a model was not trained by train(), a ModelNotTrainedException will be raised while performing prediction or saving the model.

Parameters:nameprefix (str) – prefix of the file path
Returns:None
savemodel(nameprefix)

Save the trained model into files.

Given the prefix of the file paths, save the model into files, with name given by the prefix. There will be give files produced, one name ending with “_classlabels.txt”, one with “.json”, one with “.h5”, one with “_labelidx.pkl”, and one with “_dictionary.dict”.

If there is no trained model, a ModelNotTrainedException will be thrown.

Parameters:nameprefix (str) – prefix of the file path
Returns:None
Raise:ModelNotTrainedException
score(shorttext)

Calculate the scores for all the class labels for the given short sentence.

Given a short sentence, calculate the classification scores for all class labels, returned as a dictionary with key being the class labels, and values being the scores. If the short sentence is empty, or if other numerical errors occur, the score will be numpy.nan. If neither train() nor loadmodel() was run, it will raise ModelNotTrainedException.

Parameters:shorttext (str) – a short sentence
Returns:a dictionary with keys being the class labels, and values being the corresponding classification scores
Return type:dict
Raise:ModelNotTrainedException
shorttext_to_vec(shorttext)

Convert the shorttext into a sparse vector given the dictionary.

According to the dictionary (gensim.corpora.Dictionary), convert the given text into a vector representation, according to the occurence of tokens.

This function is deprecated and no longer used because it is too slow to run in a loop. But this is used while doing prediction.

Parameters:shorttext (str) – short text to be converted.
Returns:sparse vector of the vector representation
Return type:scipy.sparse.dok_matrix
train(classdict, nb_epochs=500, l2reg=0.01, bias_l2reg=0.01, optimizer='adam')

Train the classifier.

Given the training data, train the classifier.

Parameters:
  • classdict (dict) – training data
  • nb_epochs (int) – number of epochs (Defauly: 500)
  • l2reg (float) – L2 regularization coefficient (Default: 0.01)
  • bias_l2reg (float) – L2 regularization coefficient for bias (Default: 0.01)
  • optimizer (str) – optimizer for gradient descent. Options: sgd, rmsprop, adagrad, adadelta, adam, adamax, nadam. (Default: adam)
Returns:

None

shorttext.classifiers.bow.maxent.MaxEntClassification.load_maxent_classifier(name, compact=True)

Load the maximum entropy classifier from saved model.

Given a moel file(s), load the maximum entropy classifier.

Parameters:
  • name (str) – name or prefix of the file, if compact is True or False respectively
  • compact (bool) – whether the model file is compact (Default:True)
Returns:

maximum entropy classifier

Return type:

MaxEntClassifier

shorttext.classifiers.bow.maxent.MaxEntClassification.logistic_framework(nb_features, nb_outputs, l2reg=0.01, bias_l2reg=0.01, optimizer='adam')

Construct the neural network of maximum entropy classifier.

Given the numbers of features and the output labels, return a keras neural network
for implementing maximum entropy (multinomial) classifier.
Parameters:
  • nb_features (int) – number of features
  • nb_outputs (int) – number of output labels
  • l2reg (float) – L2 regularization coefficient (Default: 0.01)
  • bias_l2reg (float) – L2 regularization coefficient for bias (Default: 0.01)
  • optimizer (str) – optimizer for gradient descent. Options: sgd, rmsprop, adagrad, adadelta, adam, adamax, nadam. (Default: adam)
Returns:

keras sequential model for maximum entropy classifier

Return type:

keras.model.Sequential

Reference

Adam L. Berger, Stephen A. Della Pietra, Vincent J. Della Pietra, “A Maximum Entropy Approach to Natural Language Processing,” Computational Linguistics 22(1): 39-72 (1996). [ACM]

Daniel E. Russ, Kwan-Yuet Ho, Joanne S. Colt, Karla R. Armenti, Dalsu Baris, Wong-Ho Chow, Faith Davis, Alison Johnson, Mark P. Purdue, Margaret R. Karagas, Kendra Schwartz, Molly Schwenn, Debra T. Silverman, Patricia A. Stewart, Calvin A. Johnson, Melissa C. Friesen, “Computer-based coding of free-text job descriptions to efficiently and reliably incorporate occupational risk factors into large-scale epidemiological studies”, Occup. Environ. Med. 73, 417-424 (2016). [BMJ]

Daniel Russ, Kwan-yuet Ho, Melissa Friesen, “It Takes a Village To Solve A Problem in Data Science,” Data Science Maryland, presentation at Applied Physics Laboratory (APL), Johns Hopkins University, on June 19, 2017. (2017) [Slideshare]

Home: Homepage of shorttext