Maximum Entropy (MaxEnt) Classifier¶
Maxent¶
Maximum entropy (maxent) classifier has been a popular text classifier, by parameterizing the model to achieve maximum categorical entropy, with the constraint that the resulting probability on the training data with the model being equal to the real distribution.
The maxent classifier in shorttext is impleneted by keras. The optimization algorithm is defaulted to be the Adam optimizer, although other gradient-based or momentum-based optimizers can be used. The traditional methods such as generative iterative scaling (GIS) or L-BFGS cannot be used here.
To use the maxent classifier, import the package:
>>> import shorttext
>>> from shorttext.classifiers import MaxEntClassifier
Loading NIH reports as an example:
>>> classdict = shorttext.data.nihreports()
The classifier can be instantiated by:
>>> classifier = MaxEntClassifier()
Train the classifier:
>>> classifier.train(classdict, nb_epochs=1000)
After training, it can be used for classification, such as
>>> classifier.score('cancer immunology') # NCI tops the score
>>> classifier.score('children health') # NIAID tops the score
>>> classifier.score('Alzheimer disease and aging') # NIAID tops the score
To save the model,
>>> classifier.save_compact_model('/path/to/filename.bin')
To load the model to be a classifier, enter:
>>> classifier2 = shorttext.classifiers.load_maxent_classifier('/path/to/filename.bin')
-
class
shorttext.classifiers.bow.maxent.MaxEntClassification.
MaxEntClassifier
(preprocessor=<function MaxEntClassifier.<lambda>>)¶ This is a classifier that implements the principle of maximum entropy.
Reference: * Adam L. Berger, Stephen A. Della Pietra, Vincent J. Della Pietra, “A Maximum Entropy Approach to Natural Language Processing,” Computational Linguistics 22(1): 39-72 (1996).
-
convert_classdict_to_XY
(classdict)¶ Convert the training data into sparse matrices for training.
Parameters: classdict (dict) – training data Returns: a tuple, consisting of sparse matrices for X (training data) and y (the labels of the training data) Return type: tuple
-
index_classlabels
()¶ Index the class outcome labels.
Index the class outcome labels into integers, for neural network implementation.
-
loadmodel
(nameprefix)¶ Load a trained model from files.
Given the prefix of the file paths, load the model from files with name given by the prefix followed by “_classlabels.txt”, “.json”, “.h5”, “_labelidx.pkl”, and “_dictionary.dict”.
If this has not been run, or a model was not trained by
train()
, a ModelNotTrainedException will be raised while performing prediction or saving the model.Parameters: nameprefix (str) – prefix of the file path Returns: None
-
savemodel
(nameprefix)¶ Save the trained model into files.
Given the prefix of the file paths, save the model into files, with name given by the prefix. There will be give files produced, one name ending with “_classlabels.txt”, one with “.json”, one with “.h5”, one with “_labelidx.pkl”, and one with “_dictionary.dict”.
If there is no trained model, a ModelNotTrainedException will be thrown.
Parameters: nameprefix (str) – prefix of the file path Returns: None Raise: ModelNotTrainedException
-
score
(shorttext)¶ Calculate the scores for all the class labels for the given short sentence.
Given a short sentence, calculate the classification scores for all class labels, returned as a dictionary with key being the class labels, and values being the scores. If the short sentence is empty, or if other numerical errors occur, the score will be numpy.nan. If neither
train()
norloadmodel()
was run, it will raise ModelNotTrainedException.Parameters: shorttext (str) – a short sentence Returns: a dictionary with keys being the class labels, and values being the corresponding classification scores Return type: dict Raise: ModelNotTrainedException
-
shorttext_to_vec
(shorttext)¶ Convert the shorttext into a sparse vector given the dictionary.
According to the dictionary (gensim.corpora.Dictionary), convert the given text into a vector representation, according to the occurence of tokens.
This function is deprecated and no longer used because it is too slow to run in a loop. But this is used while doing prediction.
Parameters: shorttext (str) – short text to be converted. Returns: sparse vector of the vector representation Return type: scipy.sparse.dok_matrix
-
train
(classdict, nb_epochs=500, l2reg=0.01, bias_l2reg=0.01, optimizer='adam')¶ Train the classifier.
Given the training data, train the classifier.
Parameters: - classdict (dict) – training data
- nb_epochs (int) – number of epochs (Defauly: 500)
- l2reg (float) – L2 regularization coefficient (Default: 0.01)
- bias_l2reg (float) – L2 regularization coefficient for bias (Default: 0.01)
- optimizer (str) – optimizer for gradient descent. Options: sgd, rmsprop, adagrad, adadelta, adam, adamax, nadam. (Default: adam)
Returns: None
-
-
shorttext.classifiers.bow.maxent.MaxEntClassification.
load_maxent_classifier
(name, compact=True)¶ Load the maximum entropy classifier from saved model.
Given a moel file(s), load the maximum entropy classifier.
Parameters: - name (str) – name or prefix of the file, if compact is True or False respectively
- compact (bool) – whether the model file is compact (Default:True)
Returns: maximum entropy classifier
Return type:
-
shorttext.classifiers.bow.maxent.MaxEntClassification.
logistic_framework
(nb_features, nb_outputs, l2reg=0.01, bias_l2reg=0.01, optimizer='adam')¶ Construct the neural network of maximum entropy classifier.
- Given the numbers of features and the output labels, return a keras neural network
- for implementing maximum entropy (multinomial) classifier.
Parameters: - nb_features (int) – number of features
- nb_outputs (int) – number of output labels
- l2reg (float) – L2 regularization coefficient (Default: 0.01)
- bias_l2reg (float) – L2 regularization coefficient for bias (Default: 0.01)
- optimizer (str) – optimizer for gradient descent. Options: sgd, rmsprop, adagrad, adadelta, adam, adamax, nadam. (Default: adam)
Returns: keras sequential model for maximum entropy classifier
Return type: keras.model.Sequential
Reference¶
Adam L. Berger, Stephen A. Della Pietra, Vincent J. Della Pietra, “A Maximum Entropy Approach to Natural Language Processing,” Computational Linguistics 22(1): 39-72 (1996). [ACM]
Daniel E. Russ, Kwan-Yuet Ho, Joanne S. Colt, Karla R. Armenti, Dalsu Baris, Wong-Ho Chow, Faith Davis, Alison Johnson, Mark P. Purdue, Margaret R. Karagas, Kendra Schwartz, Molly Schwenn, Debra T. Silverman, Patricia A. Stewart, Calvin A. Johnson, Melissa C. Friesen, “Computer-based coding of free-text job descriptions to efficiently and reliably incorporate occupational risk factors into large-scale epidemiological studies”, Occup. Environ. Med. 73, 417-424 (2016). [BMJ]
Daniel Russ, Kwan-yuet Ho, Melissa Friesen, “It Takes a Village To Solve A Problem in Data Science,” Data Science Maryland, presentation at Applied Physics Laboratory (APL), Johns Hopkins University, on June 19, 2017. (2017) [Slideshare]
Home: Homepage of shorttext