# Document-Term Matrix¶

## Preparing for the Corpus¶

We can create and handle document-term matrix (DTM) with shorttext. Use the dataset of Presidents’ Inaugural Addresses as an example.

>>> import shorttext
>>> usprez = shorttext.data.inaugural()


We have to make each presidents’ address to be one document to achieve our purpose. Enter this:

>>> docids = sorted(usprez.keys())
>>> usprez = [' '.join(usprez[docid]) for docid in docids]


Now the variable usprez is a list of 56 Inaugural Addresses from George Washington (1789) to Barack Obama (2009), with the IDs stored in docids. We apply the standard text preprocessor and produce a list of lists (of tokens) (or a corpus in gensim):

>>> preprocess = shorttext.utils.standard_text_preprocessor_1()


Then now the variable corpus is a list of lists of tokens. For example,

>>> corpus[0]     # shows all the preprocessed tokens of the first Presidential Inaugural Addresses


## Using Class DocumentTermMatrix¶

With the corpus ready in this form, we can create a DocumentTermMatrix class for DTM by:

>>> usprez_dtm = shorttext.utils.DocumentTermMatrix(corpus, docids=docids)

class shorttext.utils.dtm.DocumentTermMatrix(corpus, docids=None, tfidf=False)

Document-term matrix for corpus.

This is a class that handles the document-term matrix (DTM). With a given corpus, users can retrieve term frequency, document frequency, and total term frequency. Weighing using tf-idf can be applied.

generate_dtm(corpus, tfidf=False)

Generate the inside document-term matrix and other peripherical information objects. This is run when the class is instantiated.

Parameters: corpus (list) – corpus. tfidf (bool) – whether to weigh using tf-idf. (Default: False) None
generate_dtm_dataframe()

Generate the data frame of the document-term matrix. (shorttext <= 1.0.3)

Now it raises exception.

Returns: data frame of the document-term matrix pandas.DataFrame NotImplementedException
get_doc_frequency(token)

Retrieve the document frequency of the given token.

Compute the document frequency of the given token, i.e., the number of documents that this token can be found.

Parameters: token (str) – term or token document frequency of the given token int
get_doc_tokens(docid)

Retrieve the term frequencies of all tokens in the given document.

Compute the term frequencies of all tokens for the given document. If tfidf is set to be True while instantiating the class, it returns the weighted term frequencies.

This method returns a dictionary of term frequencies with the tokens as the keys.

Parameters: docid (any) – document ID a dictionary of term frequencies with the tokens as the keys dict
get_termfreq(docid, token)

Retrieve the term frequency of a given token in a particular document.

Given a token and a particular document ID, compute the term frequency for this token. If tfidf is set to True while instantiating the class, it returns the weighted term frequency.

Parameters: docid (any) – document ID token (str) – term or token term frequency or weighted term frequency of the given token in this document (designated by docid) numpy.float
get_token_occurences(token)

Retrieve the term frequencies of a given token in all documents.

Compute the term frequencies of the given token for all the documents. If tfidf is set to be True while instantiating the class, it returns the weighted term frequencies.

This method returns a dictionary of term frequencies with the corresponding document IDs as the keys.

Parameters: token (str) – term or token a dictionary of term frequencies with the corresponding document IDs as the keys dict
get_total_termfreq(token)

Retrieve the total occurrences of the given token.

Compute the total occurrences of the term in all documents. If tfidf is set to True while instantiating the class, it returns the sum of weighted term frequency.

Parameters: token (str) – term or token total occurrences of the given token numpy.float
loadmodel(prefix)

Parameters: prefix (str) – prefix of the files None
savemodel(prefix)

Save the model.

Parameters: prefix (str) – prefix of the files None

One can get the document frequency of any token (the number of documents that the given token is in) by:

>>> usprez_dtm.get_doc_frequency('peopl')  # gives 54, the document frequency of the token "peopl"


or the total term frequencies (the total number of occurrences of the given tokens in all documents) by:

>>> usprez_dtm.get_total_termfreq('justic')   # gives 134.0, the total term frequency of the token "justic"


or the term frequency for a token in a given document by:

>>> usprez_dtm.get_termfreq('2009-Obama', 'chang')    # gives 2.0


We can also query the number of occurrences of a particular word of all documents, stored in a dictionary, by:

>>> usprez_dtm.get_token_occurences('god')


Of course, we can always reweigh the counts above (except document frequency) by imposing tf-idf while creating the instance of the class by enforceing tfidf to be True:

>>> usprez_dtm = shorttext.utils.DocumentTermMatrix(corpus, docids=docids, tfidf=True)


To save the class, enter:

>>> usprez_dtm.save_compact_model('/path/to/whatever.bin')


To load this class later, enter:

>>> usprez_dtm2 = shorttext.utils.load_DocumentTermMatrix('/path/to/whatever.bin')

shorttext.utils.dtm.load_DocumentTermMatrix(filename, compact=True)