Document-Term Matrix

Preparing for the Corpus

We can create and handle document-term matrix (DTM) with shorttext. Use the dataset of Presidents’ Inaugural Addresses as an example.

>>> import shorttext
>>> usprez = shorttext.data.inaugural()

We have to make each presidents’ address to be one document to achieve our purpose. Enter this:

>>> docids = sorted(usprez.keys())
>>> usprez = [' '.join(usprez[docid]) for docid in docids]

Now the variable usprez is a list of 56 Inaugural Addresses from George Washington (1789) to Barack Obama (2009), with the IDs stored in docids. We apply the standard text preprocessor and produce a list of lists (of tokens) (or a corpus in gensim):

>>> preprocess = shorttext.utils.standard_text_preprocessor_1()
>>> corpus = [preprocess(address).split(' ') for address in usprez]

Then now the variable corpus is a list of lists of tokens. For example,

>>> corpus[0]     # shows all the preprocessed tokens of the first Presidential Inaugural Addresses

Using Class DocumentTermMatrix

With the corpus ready in this form, we can create a DocumentTermMatrix class for DTM by:

>>> usprez_dtm = shorttext.utils.DocumentTermMatrix(corpus, docids=docids)

One can get the document frequency of any token (the number of documents that the given token is in) by:

>>> usprez_dtm.get_doc_frequency('peopl')  # gives 54, the document frequency of the token "peopl"

or the total term frequencies (the total number of occurrences of the given tokens in all documents) by:

>>> usprez_dtm.get_total_termfreq('justic')   # gives 134.0, the total term frequency of the token "justic"

or the term frequency for a token in a given document by:

>>> usprez_dtm.get_termfreq('2009-Obama', 'chang')    # gives 2.0

We can also query the number of occurrences of a particular word of all documents, stored in a dictionary, by:

>>> usprez_dtm.get_token_occurences('god')

Of course, we can always reweigh the counts above (except document frequency) by imposing tf-idf while creating the instance of the class by enforceing tfidf to be True:

>>> usprez_dtm = shorttext.utils.DocumentTermMatrix(corpus, docids=docids, tfidf=True)

To save the class, enter:

>>> usprez_dtm.save_compact_model('/path/to/whatever.bin')

To load this class later, enter:

>>> usprez_dtm2 = shorttext.utils.load_DocumentTermMatrix('/path/to/whatever.bin')

Reference

Christopher Manning, Hinrich Schuetze, Foundations of Statistical Natural Language Processing (Cambridge, MA: MIT Press, 1999). [MIT Press]

“Document-Term Matrix: Text Mining in R and Python,” Everything About Data Analytics, WordPress (2018). [WordPress]

Home: Homepage of shorttext