Document-Term Matrix

Preparing for the Corpus

We can create and handle document-term matrix (DTM) with shorttext. Use the dataset of Presidents’ Inaugural Addresses as an example.

>>> import shorttext
>>> usprez = shorttext.data.inaugural()

We have to make each presidents’ address to be one document to achieve our purpose. Enter this:

>>> docids = sorted(usprez.keys())
>>> usprez = [' '.join(usprez[docid]) for docid in docids]

Now the variable usprez is a list of 56 Inaugural Addresses from George Washington (1789) to Barack Obama (2009), with the IDs stored in docids. We apply the standard text preprocessor and produce a list of lists (of tokens) (or a corpus in gensim):

>>> preprocess = shorttext.utils.standard_text_preprocessor_1()
>>> corpus = [preprocess(address).split(' ') for address in usprez]

Then now the variable corpus is a list of lists of tokens. For example,

>>> corpus[0]     # shows all the preprocessed tokens of the first Presidential Inaugural Addresses

Using Class DocumentTermMatrix

With the corpus ready in this form, we can create a DocumentTermMatrix class for DTM by:

>>> usprez_dtm = shorttext.utils.DocumentTermMatrix(corpus, docids=docids)
class shorttext.utils.dtm.DocumentTermMatrix(corpus, docids=None, tfidf=False)

Document-term matrix for corpus.

This is a class that handles the document-term matrix (DTM). With a given corpus, users can retrieve term frequency, document frequency, and total term frequency. Weighing using tf-idf can be applied.

generate_dtm(corpus, tfidf=False)

Generate the inside document-term matrix and other peripherical information objects. This is run when the class is instantiated.

Parameters:
  • corpus (list) – corpus.
  • tfidf (bool) – whether to weigh using tf-idf. (Default: False)
Returns:

None

generate_dtm_dataframe()

Generate the data frame of the document-term matrix. (shorttext <= 1.0.3)

Now it raises exception.

Returns:data frame of the document-term matrix
Return type:pandas.DataFrame
Raise:NotImplementedException
get_doc_frequency(token)

Retrieve the document frequency of the given token.

Compute the document frequency of the given token, i.e., the number of documents that this token can be found.

Parameters:token (str) – term or token
Returns:document frequency of the given token
Return type:int
get_doc_tokens(docid)

Retrieve the term frequencies of all tokens in the given document.

Compute the term frequencies of all tokens for the given document. If tfidf is set to be True while instantiating the class, it returns the weighted term frequencies.

This method returns a dictionary of term frequencies with the tokens as the keys.

Parameters:docid (any) – document ID
Returns:a dictionary of term frequencies with the tokens as the keys
Return type:dict
get_termfreq(docid, token)

Retrieve the term frequency of a given token in a particular document.

Given a token and a particular document ID, compute the term frequency for this token. If tfidf is set to True while instantiating the class, it returns the weighted term frequency.

Parameters:
  • docid (any) – document ID
  • token (str) – term or token
Returns:

term frequency or weighted term frequency of the given token in this document (designated by docid)

Return type:

numpy.float

get_token_occurences(token)

Retrieve the term frequencies of a given token in all documents.

Compute the term frequencies of the given token for all the documents. If tfidf is set to be True while instantiating the class, it returns the weighted term frequencies.

This method returns a dictionary of term frequencies with the corresponding document IDs as the keys.

Parameters:token (str) – term or token
Returns:a dictionary of term frequencies with the corresponding document IDs as the keys
Return type:dict
get_total_termfreq(token)

Retrieve the total occurrences of the given token.

Compute the total occurrences of the term in all documents. If tfidf is set to True while instantiating the class, it returns the sum of weighted term frequency.

Parameters:token (str) – term or token
Returns:total occurrences of the given token
Return type:numpy.float
loadmodel(prefix)

Load the model.

Parameters:prefix (str) – prefix of the files
Returns:None
savemodel(prefix)

Save the model.

Parameters:prefix (str) – prefix of the files
Returns:None

One can get the document frequency of any token (the number of documents that the given token is in) by:

>>> usprez_dtm.get_doc_frequency('peopl')  # gives 54, the document frequency of the token "peopl"

or the total term frequencies (the total number of occurrences of the given tokens in all documents) by:

>>> usprez_dtm.get_total_termfreq('justic')   # gives 134.0, the total term frequency of the token "justic"

or the term frequency for a token in a given document by:

>>> usprez_dtm.get_termfreq('2009-Obama', 'chang')    # gives 2.0

We can also query the number of occurrences of a particular word of all documents, stored in a dictionary, by:

>>> usprez_dtm.get_token_occurences('god')

Of course, we can always reweigh the counts above (except document frequency) by imposing tf-idf while creating the instance of the class by enforceing tfidf to be True:

>>> usprez_dtm = shorttext.utils.DocumentTermMatrix(corpus, docids=docids, tfidf=True)

To save the class, enter:

>>> usprez_dtm.save_compact_model('/path/to/whatever.bin')

To load this class later, enter:

>>> usprez_dtm2 = shorttext.utils.load_DocumentTermMatrix('/path/to/whatever.bin')
shorttext.utils.dtm.load_DocumentTermMatrix(filename, compact=True)

Load presaved Document-Term Matrix (DTM).

Given the file name (if compact is True) or the prefix (if compact is False), return the document-term matrix.

Parameters:
  • filename (str) – file name or prefix
  • compact (bool) – whether it is a compact model. (Default: True)
Returns:

document-term matrix

Return type:

DocumentTermMatrix

Reference

Christopher Manning, Hinrich Schuetze, Foundations of Statistical Natural Language Processing (Cambridge, MA: MIT Press, 1999). [MIT Press]

“Document-Term Matrix: Text Mining in R and Python,” Everything About Data Analytics, WordPress (2018). [WordPress]

Home: Homepage of shorttext