Document-Term Matrix
Preparing for the Corpus
We can create and handle document-term matrix (DTM) with shorttext. Use the dataset of Presidents’ Inaugural Addresses as an example.
>>> import shorttext
>>> usprez = shorttext.data.inaugural()
We have to make each presidents’ address to be one document to achieve our purpose. Enter this:
>>> docids = sorted(usprez.keys())
>>> usprez = [' '.join(usprez[docid]) for docid in docids]
Now the variable usprez is a list of 56 Inaugural Addresses from George Washington (1789) to Barack Obama (2009), with the IDs stored in docids. We apply the standard text preprocessor and produce a list of lists (of tokens) (or a corpus in gensim):
>>> preprocess = shorttext.utils.standard_text_preprocessor_1()
>>> corpus = [preprocess(address).split(' ') for address in usprez]
Then now the variable corpus is a list of lists of tokens. For example,
>>> corpus[0] # shows all the preprocessed tokens of the first Presidential Inaugural Addresses
Using Class NumpyDocumentTermMatrix
Note: the old class DocumentTermMatrix has been removed in release 5.0.0.
With the corpus ready in this form, we can create a NumpyDocumentTermMatrix class for DTM by: (imposing tf-idf while creating the instance of the class by enforceing tfidf to be True)
>>> dtm = shorttext.utils.NumpyDocumentTermMatrix(corpus, docids, tfidf=True)
- class shorttext.utils.dtm.NumpyDocumentTermMatrix(corpus: list[str] | None = None, docids: list[Any] | None = None, tfidf: bool = False, tokenize_func: callable | None = None)[source]
Bases:
CompactIOMachineDocument-term matrix using numpy dict.
Provides an interface for working with document-term matrices with compact model I/O support.
- __init__(corpus: list[str] | None = None, docids: list[Any] | None = None, tfidf: bool = False, tokenize_func: callable | None = None)[source]
Initialize the document-term matrix.
- Args:
corpus: List of documents. docids: List of document IDs. tfidf: Whether to apply TF-IDF weighting. Default: False. tokenize_func: Tokenization function. Default: advanced_text_tokenizer_1.
- generate_dtm(corpus: list[str], docids: list[Any] | None = None, tfidf: bool = False) None[source]
Generate document-term matrix from corpus.
- Args:
corpus: List of documents. docids: List of document IDs. tfidf: Whether to apply TF-IDF weighting. Default: False.
- get_termfreq(docid: str, token: str) float[source]
Get term frequency for a document and token.
- Args:
docid: Document ID. token: Token.
- Returns:
Term frequency.
- get_total_termfreq(token: str) float[source]
Get total frequency of a token across all documents.
- Args:
token: Token.
- Returns:
Total term frequency.
- get_doc_frequency(token) int[source]
Get document frequency of a token.
- Args:
token: Token.
- Returns:
Number of documents containing the token.
- get_token_occurences(token: str) dict[str, float][source]
Get token occurrences across all documents.
- Args:
token: Token.
- Returns:
Dictionary mapping document IDs to term frequencies.
- get_doc_tokens(docid: str) dict[str, float][source]
Get tokens for a specific document.
- Args:
docid: Document ID.
- Returns:
Dictionary mapping tokens to frequencies.
- savemodel(nameprefix: str) None[source]
Save the document-term matrix.
- Args:
nameprefix: Prefix for output file.
- loadmodel(nameprefix: str) Self[source]
Load the document-term matrix.
- Args:
nameprefix: Prefix for input file.
- property docids: list[str]
List of document IDs.
- property tokens: list[str]
List of tokens.
- property nbdocs: int
Number of documents.
- property nbtokens: int
Number of unique tokens.
One can get the document frequency of any token (the number of documents that the given token is in) by:
>>> dtm.get_doc_frequency('peopl') # gives 54, the document frequency of the token "peopl"
or the total term frequencies (the total number of occurrences of the given tokens in all documents) by:
>>> dtm.get_total_termfreq('justic') # gives 32.32, the total term frequency of the token "justic"
or the term frequency for a token in a given document by:
>>> dtm.get_termfreq('2009-Obama', 'chang') # gives 0.94
We can also query the number of occurrences of a particular word of all documents, stored in a dictionary, by:
>>> dtm.get_token_occurences('god')
To save the class, enter:
>>> usprez_dtm.save_compact_model('/path/to/whatever.bin')
To load this class later, enter:
>>> usprez_dtm2 = shorttext.utils.NumpyDocumentTermMatrix.from_npdict_file('/path/to/whatever.bin')
- shorttext.utils.dtm.generate_npdict_document_term_matrix(corpus: list[str], doc_ids: list[Any], tokenize_func: callable) NumpyNDArrayWrappedDict[source]
Generate document-term matrix as numpy dict.
- Args:
corpus: List of documents. doc_ids: List of document IDs. tokenize_func: Tokenization function.
- Returns:
NumpyNDArrayWrappedDict containing the document-term matrix.
- Raises:
UnequalArrayLengthsException: If corpus and doc_ids have different lengths.
- shorttext.utils.dtm.convert_classdict_to_corpus(classdict: dict[str, list[str]], preprocess_func: callable) tuple[list[str], list[str]][source]
Convert class dictionary to corpus and document IDs.
- Args:
classdict: Training data with class labels as keys and texts as values. preprocess_func: Text preprocessing function.
- Returns:
Tuple of (corpus, doc_ids).
- shorttext.utils.dtm.convert_classdict_to_xy(classdict: dict[str, list[str]], labels2idx: dict[str, int], preprocess_func: callable, tokenize_func: callable) tuple[NumpyNDArrayWrappedDict, Annotated[SparseArray, '2D Array']][source]
Convert class dictionary to feature matrix and labels.
- Args:
classdict: Training data. labels2idx: Mapping from labels to indices. preprocess_func: Text preprocessing function. tokenize_func: Tokenization function.
- Returns:
Tuple of (document-term matrix, label matrix).
- shorttext.utils.dtm.compute_document_frequency(npdtm: NumpyNDArrayWrappedDict) ndarray[tuple[Any, ...], dtype[int32]][source]
Compute document frequency for each token.
- Args:
npdtm: Document-term matrix.
- Returns:
Array of document frequencies for each token.
- shorttext.utils.dtm.compute_tfidf_document_term_matrix(npdtm: NumpyNDArrayWrappedDict, sparse: bool = True) NumpyNDArrayWrappedDict[source]
Compute TF-IDF weighted document-term matrix.
- Args:
npdtm: Document-term matrix. sparse: Whether to return sparse format. Default: True.
- Returns:
TF-IDF weighted document-term matrix.
- class shorttext.utils.dtm.NumpyDocumentTermMatrix(corpus: list[str] | None = None, docids: list[Any] | None = None, tfidf: bool = False, tokenize_func: callable | None = None)[source]
Bases:
CompactIOMachineDocument-term matrix using numpy dict.
Provides an interface for working with document-term matrices with compact model I/O support.
- __init__(corpus: list[str] | None = None, docids: list[Any] | None = None, tfidf: bool = False, tokenize_func: callable | None = None)[source]
Initialize the document-term matrix.
- Args:
corpus: List of documents. docids: List of document IDs. tfidf: Whether to apply TF-IDF weighting. Default: False. tokenize_func: Tokenization function. Default: advanced_text_tokenizer_1.
- generate_dtm(corpus: list[str], docids: list[Any] | None = None, tfidf: bool = False) None[source]
Generate document-term matrix from corpus.
- Args:
corpus: List of documents. docids: List of document IDs. tfidf: Whether to apply TF-IDF weighting. Default: False.
- get_termfreq(docid: str, token: str) float[source]
Get term frequency for a document and token.
- Args:
docid: Document ID. token: Token.
- Returns:
Term frequency.
- get_total_termfreq(token: str) float[source]
Get total frequency of a token across all documents.
- Args:
token: Token.
- Returns:
Total term frequency.
- get_doc_frequency(token) int[source]
Get document frequency of a token.
- Args:
token: Token.
- Returns:
Number of documents containing the token.
- get_token_occurences(token: str) dict[str, float][source]
Get token occurrences across all documents.
- Args:
token: Token.
- Returns:
Dictionary mapping document IDs to term frequencies.
- get_doc_tokens(docid: str) dict[str, float][source]
Get tokens for a specific document.
- Args:
docid: Document ID.
- Returns:
Dictionary mapping tokens to frequencies.
- savemodel(nameprefix: str) None[source]
Save the document-term matrix.
- Args:
nameprefix: Prefix for output file.
- loadmodel(nameprefix: str) Self[source]
Load the document-term matrix.
- Args:
nameprefix: Prefix for input file.
- property docids: list[str]
List of document IDs.
- property tokens: list[str]
List of tokens.
- property nbdocs: int
Number of documents.
- property nbtokens: int
Number of unique tokens.
- shorttext.utils.dtm.load_numpy_documentmatrixmatrix(filepath: str | PathLike) NumpyDocumentTermMatrix[source]
Deprecated. Use ~NumpyDocumentTermMatrix.from_npdict_file.
Deprecated since version 4.0.1: This will be removed in 5.0.0.
Reference
Christopher Manning, Hinrich Schuetze, Foundations of Statistical Natural Language Processing (Cambridge, MA: MIT Press, 1999). [MIT Press]
“Document-Term Matrix: Text Mining in R and Python,” Everything About Data Analytics, WordPress (2018). [WordPress]
Home: Homepage of shorttext