Document-Term Matrix

Preparing for the Corpus

We can create and handle document-term matrix (DTM) with shorttext. Use the dataset of Presidents’ Inaugural Addresses as an example.

>>> import shorttext
>>> usprez = shorttext.data.inaugural()

We have to make each presidents’ address to be one document to achieve our purpose. Enter this:

>>> docids = sorted(usprez.keys())
>>> usprez = [' '.join(usprez[docid]) for docid in docids]

Now the variable usprez is a list of 56 Inaugural Addresses from George Washington (1789) to Barack Obama (2009), with the IDs stored in docids. We apply the standard text preprocessor and produce a list of lists (of tokens) (or a corpus in gensim):

>>> preprocess = shorttext.utils.standard_text_preprocessor_1()
>>> corpus = [preprocess(address).split(' ') for address in usprez]

Then now the variable corpus is a list of lists of tokens. For example,

>>> corpus[0]     # shows all the preprocessed tokens of the first Presidential Inaugural Addresses

Using Class NumpyDocumentTermMatrix

Note: the old class DocumentTermMatrix has been removed in release 5.0.0.

With the corpus ready in this form, we can create a NumpyDocumentTermMatrix class for DTM by: (imposing tf-idf while creating the instance of the class by enforceing tfidf to be True)

>>> dtm = shorttext.utils.NumpyDocumentTermMatrix(corpus, docids, tfidf=True)

class shorttext.utils.dtm.NumpyDocumentTermMatrix(corpus: list[str] | None = None, docids: list[Any] | None = None, tfidf: bool = False, tokenize_func: callable | None = None)[source]

Bases: CompactIOMachine

Document-term matrix using numpy dict.

Provides an interface for working with document-term matrices with compact model I/O support.

__init__(corpus: list[str] | None = None, docids: list[Any] | None = None, tfidf: bool = False, tokenize_func: callable | None = None)[source]

Initialize the document-term matrix.

Args:: corpus: List of documents. docids: List of document IDs. tfidf: Whether to apply TF-IDF weighting. Default: False. tokenize_func: Tokenization function. Default: advanced_text_tokenizer_1.

generate_dtm(corpus: list[str], docids: list[Any] | None = None, tfidf: bool = False) → None[source]

Generate document-term matrix from corpus.

Args:: corpus: List of documents. docids: List of document IDs. tfidf: Whether to apply TF-IDF weighting. Default: False.

get_termfreq(docid: str, token: str) → float[source]

Get term frequency for a document and token.

Args:: docid: Document ID. token: Token.
Returns:: Term frequency.

get_total_termfreq(token: str) → float[source]

Get total frequency of a token across all documents.

Args:: token: Token.
Returns:: Total term frequency.

get_doc_frequency(token) → int[source]

Get document frequency of a token.

Args:: token: Token.
Returns:: Number of documents containing the token.

get_token_occurences(token: str) → dict[str, float][source]

Get token occurrences across all documents.

Args:: token: Token.
Returns:: Dictionary mapping document IDs to term frequencies.

get_doc_tokens(docid: str) → dict[str, float][source]

Get tokens for a specific document.

Args:: docid: Document ID.
Returns:: Dictionary mapping tokens to frequencies.

savemodel(nameprefix: str) → None[source]

Save the document-term matrix.

Args:: nameprefix: Prefix for output file.

loadmodel(nameprefix: str) → Self[source]

Load the document-term matrix.

Args:: nameprefix: Prefix for input file.

property docids: list[str]: List of document IDs.

property tokens: list[str]: List of tokens.

property nbdocs: int: Number of documents.

property nbtokens: int: Number of unique tokens.

classmethod from_npdict_file(filepath: str | PathLike) → Self[source]

Load a document-term matrix from a compact file.

Args:: filepath: Path to the compact model file.
Returns:: NumpyDocumentTermMatrix instance.

One can get the document frequency of any token (the number of documents that the given token is in) by:

>>> dtm.get_doc_frequency('peopl')  # gives 54, the document frequency of the token "peopl"

or the total term frequencies (the total number of occurrences of the given tokens in all documents) by:

>>> dtm.get_total_termfreq('justic')   # gives 32.32, the total term frequency of the token "justic"

or the term frequency for a token in a given document by:

>>> dtm.get_termfreq('2009-Obama', 'chang')    # gives 0.94

We can also query the number of occurrences of a particular word of all documents, stored in a dictionary, by:

>>> dtm.get_token_occurences('god')

To save the class, enter:

>>> usprez_dtm.save_compact_model('/path/to/whatever.bin')

To load this class later, enter:

>>> usprez_dtm2 = shorttext.utils.NumpyDocumentTermMatrix.from_npdict_file('/path/to/whatever.bin')

shorttext.utils.dtm.generate_npdict_document_term_matrix(corpus: list[str], doc_ids: list[Any], tokenize_func: callable) → NumpyNDArrayWrappedDict[source]

Generate document-term matrix as numpy dict.

Args:: corpus: List of documents. doc_ids: List of document IDs. tokenize_func: Tokenization function.
Returns:: NumpyNDArrayWrappedDict containing the document-term matrix.
Raises:: UnequalArrayLengthsException: If corpus and doc_ids have different lengths.

shorttext.utils.dtm.convert_classdict_to_corpus(classdict: dict[str, list[str]], preprocess_func: callable) → tuple[list[str], list[str]][source]

Convert class dictionary to corpus and document IDs.

Args:: classdict: Training data with class labels as keys and texts as values. preprocess_func: Text preprocessing function.
Returns:: Tuple of (corpus, doc_ids).

shorttext.utils.dtm.convert_classdict_to_xy(classdict: dict[str, list[str]], labels2idx: dict[str, int], preprocess_func: callable, tokenize_func: callable) → tuple[NumpyNDArrayWrappedDict, Annotated[SparseArray, '2D Array']][source]

Convert class dictionary to feature matrix and labels.

Args:: classdict: Training data. labels2idx: Mapping from labels to indices. preprocess_func: Text preprocessing function. tokenize_func: Tokenization function.
Returns:: Tuple of (document-term matrix, label matrix).

shorttext.utils.dtm.compute_document_frequency(npdtm: NumpyNDArrayWrappedDict) → ndarray[tuple[Any, ...], dtype[int32]][source]

Compute document frequency for each token.

Args:: npdtm: Document-term matrix.
Returns:: Array of document frequencies for each token.

shorttext.utils.dtm.compute_tfidf_document_term_matrix(npdtm: NumpyNDArrayWrappedDict, sparse: bool = True) → NumpyNDArrayWrappedDict[source]

Compute TF-IDF weighted document-term matrix.

Args:: npdtm: Document-term matrix. sparse: Whether to return sparse format. Default: True.
Returns:: TF-IDF weighted document-term matrix.