Document-Term Matrix¶
Preparing for the Corpus¶
We can create and handle document-term matrix (DTM) with shorttext. Use the dataset of Presidents’ Inaugural Addresses as an example.
>>> import shorttext
>>> usprez = shorttext.data.inaugural()
We have to make each presidents’ address to be one document to achieve our purpose. Enter this:
>>> docids = sorted(usprez.keys())
>>> usprez = [' '.join(usprez[docid]) for docid in docids]
Now the variable usprez is a list of 56 Inaugural Addresses from George Washington (1789) to Barack Obama (2009), with the IDs stored in docids. We apply the standard text preprocessor and produce a list of lists (of tokens) (or a corpus in gensim):
>>> preprocess = shorttext.utils.standard_text_preprocessor_1()
>>> corpus = [preprocess(address).split(' ') for address in usprez]
Then now the variable corpus is a list of lists of tokens. For example,
>>> corpus[0] # shows all the preprocessed tokens of the first Presidential Inaugural Addresses
Using Class DocumentTermMatrix¶
With the corpus ready in this form, we can create a DocumentTermMatrix class for DTM by:
>>> usprez_dtm = shorttext.utils.DocumentTermMatrix(corpus, docids=docids)
-
class
shorttext.utils.dtm.
DocumentTermMatrix
(corpus, docids=None, tfidf=False)¶ Document-term matrix for corpus.
This is a class that handles the document-term matrix (DTM). With a given corpus, users can retrieve term frequency, document frequency, and total term frequency. Weighing using tf-idf can be applied.
-
generate_dtm
(corpus, tfidf=False)¶ Generate the inside document-term matrix and other peripherical information objects. This is run when the class is instantiated.
Parameters: - corpus (list) – corpus.
- tfidf (bool) – whether to weigh using tf-idf. (Default: False)
Returns: None
-
generate_dtm_dataframe
()¶ Generate the data frame of the document-term matrix. (shorttext <= 1.0.3)
Now it raises exception.
Returns: data frame of the document-term matrix Return type: pandas.DataFrame Raise: NotImplementedException
-
get_doc_frequency
(token)¶ Retrieve the document frequency of the given token.
Compute the document frequency of the given token, i.e., the number of documents that this token can be found.
Parameters: token (str) – term or token Returns: document frequency of the given token Return type: int
-
get_doc_tokens
(docid)¶ Retrieve the term frequencies of all tokens in the given document.
Compute the term frequencies of all tokens for the given document. If tfidf is set to be True while instantiating the class, it returns the weighted term frequencies.
This method returns a dictionary of term frequencies with the tokens as the keys.
Parameters: docid (any) – document ID Returns: a dictionary of term frequencies with the tokens as the keys Return type: dict
-
get_termfreq
(docid, token)¶ Retrieve the term frequency of a given token in a particular document.
Given a token and a particular document ID, compute the term frequency for this token. If tfidf is set to True while instantiating the class, it returns the weighted term frequency.
Parameters: - docid (any) – document ID
- token (str) – term or token
Returns: term frequency or weighted term frequency of the given token in this document (designated by docid)
Return type: numpy.float
-
get_token_occurences
(token)¶ Retrieve the term frequencies of a given token in all documents.
Compute the term frequencies of the given token for all the documents. If tfidf is set to be True while instantiating the class, it returns the weighted term frequencies.
This method returns a dictionary of term frequencies with the corresponding document IDs as the keys.
Parameters: token (str) – term or token Returns: a dictionary of term frequencies with the corresponding document IDs as the keys Return type: dict
-
get_total_termfreq
(token)¶ Retrieve the total occurrences of the given token.
Compute the total occurrences of the term in all documents. If tfidf is set to True while instantiating the class, it returns the sum of weighted term frequency.
Parameters: token (str) – term or token Returns: total occurrences of the given token Return type: numpy.float
-
loadmodel
(prefix)¶ Load the model.
Parameters: prefix (str) – prefix of the files Returns: None
-
savemodel
(prefix)¶ Save the model.
Parameters: prefix (str) – prefix of the files Returns: None
-
One can get the document frequency of any token (the number of documents that the given token is in) by:
>>> usprez_dtm.get_doc_frequency('peopl') # gives 54, the document frequency of the token "peopl"
or the total term frequencies (the total number of occurrences of the given tokens in all documents) by:
>>> usprez_dtm.get_total_termfreq('justic') # gives 134.0, the total term frequency of the token "justic"
or the term frequency for a token in a given document by:
>>> usprez_dtm.get_termfreq('2009-Obama', 'chang') # gives 2.0
We can also query the number of occurrences of a particular word of all documents, stored in a dictionary, by:
>>> usprez_dtm.get_token_occurences('god')
Of course, we can always reweigh the counts above (except document frequency) by imposing tf-idf while creating the instance of the class by enforceing tfidf to be True:
>>> usprez_dtm = shorttext.utils.DocumentTermMatrix(corpus, docids=docids, tfidf=True)
To save the class, enter:
>>> usprez_dtm.save_compact_model('/path/to/whatever.bin')
To load this class later, enter:
>>> usprez_dtm2 = shorttext.utils.load_DocumentTermMatrix('/path/to/whatever.bin')
-
shorttext.utils.dtm.
load_DocumentTermMatrix
(filename, compact=True)¶ Load presaved Document-Term Matrix (DTM).
Given the file name (if compact is True) or the prefix (if compact is False), return the document-term matrix.
Parameters: - filename (str) – file name or prefix
- compact (bool) – whether it is a compact model. (Default: True)
Returns: document-term matrix
Return type:
Reference¶
Christopher Manning, Hinrich Schuetze, Foundations of Statistical Natural Language Processing (Cambridge, MA: MIT Press, 1999). [MIT Press]
“Document-Term Matrix: Text Mining in R and Python,” Everything About Data Analytics, WordPress (2018). [WordPress]
Home: Homepage of shorttext