Data Preparation
This package deals with short text. While the text data for predictions or classifications are simply str or list of str, the training data does take a specific format, in terms of dict, the Python dictionary (or hash map). The package provides two sets of data as an example.
Example Training Data 1: Subject Keywords
The first example dataset is about the subject keywords, which can be loaded by:
>>> trainclassdict = shorttext.data.subjectkeywords()
This returns a dictionary, with keys being the label and the values being lists of the subject keywords, as below:
{'mathematics': ['linear algebra', 'topology', 'algebra', 'calculus',
'variational calculus', 'functional field', 'real analysis', 'complex analysis',
'differential equation', 'statistics', 'statistical optimization', 'probability',
'stochastic calculus', 'numerical analysis', 'differential geometry'],
'physics': ['renormalization', 'classical mechanics', 'quantum mechanics',
'statistical mechanics', 'functional field', 'path integral',
'quantum field theory', 'electrodynamics', 'condensed matter',
'particle physics', 'topological solitons', 'astrophysics',
'spontaneous symmetry breaking', 'atomic molecular and optical physics',
'quantum chaos'],
'theology': ['divine providence', 'soteriology', 'anthropology', 'pneumatology', 'Christology',
'Holy Trinity', 'eschatology', 'scripture', 'ecclesiology', 'predestination',
'divine degree', 'creedal confessionalism', 'scholasticism', 'prayer', 'eucharist']}
- shorttext.data.data_retrieval.retrieve_csvdata_as_dict(filepath: str | PathLike) dict[str, list[str]][source]
Retrieve the training data in a CSV file.
Reads a CSV file where the first column contains class labels and the second column contains text data. Returns a dictionary mapping class labels to lists of short texts.
- Args:
filepath: Path to the CSV training data file.
- Returns:
A dictionary with class labels as keys and lists of short texts as values.
- Reference:
Data format inspired by common text classification benchmarks.
- shorttext.data.data_retrieval.retrieve_jsondata_as_dict(filepath: str | PathLike) dict[source]
Retrieve the training data in a JSON file.
Reads a JSON file where class labels are keys and lists of short texts are values. Returns the corresponding dictionary.
- Args:
filepath: Path to the JSON training data file.
- Returns:
A dictionary with class labels as keys and lists of short texts as values.
- shorttext.data.data_retrieval.get_or_download_data(filename: str, origin: str, asbytes: bool = False) TextIOWrapper[source]
Retrieve or download a data file.
Checks if the file exists in the user’s home directory under .shorttext. If not present, downloads from the given origin URL.
- Args:
filename: Name of the file to retrieve. origin: URL to download the file from if not present locally. asbytes: If True, opens the file in binary mode. Default is False.
- Returns:
A file object (text or binary mode depending on asbytes).
- shorttext.data.data_retrieval.subjectkeywords() dict[str, list[str]][source]
Return an example dataset of subjects with keywords.
Returns a small example dataset with three subjects and their corresponding keywords, in the training input format.
- Returns:
A dictionary with subject labels as keys and lists of keywords as values.
- shorttext.data.data_retrieval.inaugural() dict[str, list[str]][source]
Return the Inaugural Addresses of US Presidents.
Returns an example dataset containing the Inaugural Addresses of all Presidents of the United States from George Washington to Barack Obama.
Each key is formatted as “year-lastname” and the value is a list of sentences from the address.
- Returns:
A dictionary with president identifiers as keys and lists of sentences as values.
- Reference:
- shorttext.data.data_retrieval.nihreports(txt_col='PROJECT_TITLE', label_col='FUNDING_ICs', sample_size=512)[source]
Return an example dataset sampled from NIH RePORT.
Returns an example dataset from NIH (National Institutes of Health) RePORT (Research Portfolio Online Reporting Tools) website.
- Args:
- txt_col: Column for text data. Options: ‘PROJECT_TITLE’ or ‘ABSTRACT_TEXT’.
Default: ‘PROJECT_TITLE’.
- label_col: Column for labels. Options: ‘FUNDING_ICs’ or ‘IC_NAME’.
Default: ‘FUNDING_ICs’.
sample_size: Number of samples to return. Set to None for all rows. Default: 512.
- Returns:
A dictionary with IC identifiers as keys and lists of text data as values.
- Reference:
https://exporter.nih.gov/ExPORTER_Catalog.aspx Dataset adapted from the R package textmineR: https://cran.r-project.org/web/packages/textmineR/index.html
- shorttext.data.data_retrieval.merge_cv_dicts(dicts: list[dict[str, list[str]]]) dict[str, list[str]][source]
Merge multiple training data dictionaries.
Combines multiple data dictionaries in the training data format into a single dictionary.
- Args:
- dicts: List of dictionaries to merge, each with class labels
as keys and lists of texts as values.
- Returns:
A merged dictionary with all class labels and texts combined.
- shorttext.data.data_retrieval.yield_crossvalidation_classdicts(classdict: dict[str, list[str]], nb_partitions: int, shuffle: bool = False) Generator[tuple[dict[str, list[str]], dict[str, list[str]]], None, None][source]
Yield training and test data partitions for cross-validation.
Partitions the training data into multiple sets. Each iteration yields a (test_dict, train_dict) pair where one partition is used as test data and the remaining partitions are combined as training data.
- Args:
- classdict: Training data dictionary with class labels as keys
and lists of texts as values.
nb_partitions: Number of partitions to create. shuffle: Whether to shuffle data before partitioning. Default: False.
- Yields:
Tuples of (test_dict, train_dict) for each partition.
Example Training Data 2: NIH RePORT
The second example dataset is from NIH RePORT (Research Portfolio Online Reporting Tools). The data can be downloaded from its ExPORTER page. The current data in this package was directly adapted from Thomas Jones’ textMineR R package.
Enter:
>>> trainclassdict = shorttext.data.nihreports()
Upon the installation of the package, the NIH RePORT data are still not installed. But the first time it was ran, it will be downloaded from the Internet.
- This will output a similar dictionary with FUNDING_IC (Institutes and Centers in NIH)
as the class labels, and PROJECT_TITLE (title of the funded projects)
as the short texts under the corresponding labels. This dictionary has 512 projects in total, randomly drawn from the original data.
However, there are other configurations:
- shorttext.data.data_retrieval.retrieve_csvdata_as_dict(filepath: str | PathLike) dict[str, list[str]][source]
Retrieve the training data in a CSV file.
Reads a CSV file where the first column contains class labels and the second column contains text data. Returns a dictionary mapping class labels to lists of short texts.
- Args:
filepath: Path to the CSV training data file.
- Returns:
A dictionary with class labels as keys and lists of short texts as values.
- Reference:
Data format inspired by common text classification benchmarks.
- shorttext.data.data_retrieval.retrieve_jsondata_as_dict(filepath: str | PathLike) dict[source]
Retrieve the training data in a JSON file.
Reads a JSON file where class labels are keys and lists of short texts are values. Returns the corresponding dictionary.
- Args:
filepath: Path to the JSON training data file.
- Returns:
A dictionary with class labels as keys and lists of short texts as values.
- shorttext.data.data_retrieval.get_or_download_data(filename: str, origin: str, asbytes: bool = False) TextIOWrapper[source]
Retrieve or download a data file.
Checks if the file exists in the user’s home directory under .shorttext. If not present, downloads from the given origin URL.
- Args:
filename: Name of the file to retrieve. origin: URL to download the file from if not present locally. asbytes: If True, opens the file in binary mode. Default is False.
- Returns:
A file object (text or binary mode depending on asbytes).
- shorttext.data.data_retrieval.subjectkeywords() dict[str, list[str]][source]
Return an example dataset of subjects with keywords.
Returns a small example dataset with three subjects and their corresponding keywords, in the training input format.
- Returns:
A dictionary with subject labels as keys and lists of keywords as values.
- shorttext.data.data_retrieval.inaugural() dict[str, list[str]][source]
Return the Inaugural Addresses of US Presidents.
Returns an example dataset containing the Inaugural Addresses of all Presidents of the United States from George Washington to Barack Obama.
Each key is formatted as “year-lastname” and the value is a list of sentences from the address.
- Returns:
A dictionary with president identifiers as keys and lists of sentences as values.
- Reference:
- shorttext.data.data_retrieval.nihreports(txt_col='PROJECT_TITLE', label_col='FUNDING_ICs', sample_size=512)[source]
Return an example dataset sampled from NIH RePORT.
Returns an example dataset from NIH (National Institutes of Health) RePORT (Research Portfolio Online Reporting Tools) website.
- Args:
- txt_col: Column for text data. Options: ‘PROJECT_TITLE’ or ‘ABSTRACT_TEXT’.
Default: ‘PROJECT_TITLE’.
- label_col: Column for labels. Options: ‘FUNDING_ICs’ or ‘IC_NAME’.
Default: ‘FUNDING_ICs’.
sample_size: Number of samples to return. Set to None for all rows. Default: 512.
- Returns:
A dictionary with IC identifiers as keys and lists of text data as values.
- Reference:
https://exporter.nih.gov/ExPORTER_Catalog.aspx Dataset adapted from the R package textmineR: https://cran.r-project.org/web/packages/textmineR/index.html
- shorttext.data.data_retrieval.merge_cv_dicts(dicts: list[dict[str, list[str]]]) dict[str, list[str]][source]
Merge multiple training data dictionaries.
Combines multiple data dictionaries in the training data format into a single dictionary.
- Args:
- dicts: List of dictionaries to merge, each with class labels
as keys and lists of texts as values.
- Returns:
A merged dictionary with all class labels and texts combined.
- shorttext.data.data_retrieval.yield_crossvalidation_classdicts(classdict: dict[str, list[str]], nb_partitions: int, shuffle: bool = False) Generator[tuple[dict[str, list[str]], dict[str, list[str]]], None, None][source]
Yield training and test data partitions for cross-validation.
Partitions the training data into multiple sets. Each iteration yields a (test_dict, train_dict) pair where one partition is used as test data and the remaining partitions are combined as training data.
- Args:
- classdict: Training data dictionary with class labels as keys
and lists of texts as values.
nb_partitions: Number of partitions to create. shuffle: Whether to shuffle data before partitioning. Default: False.
- Yields:
Tuples of (test_dict, train_dict) for each partition.
Example Training Data 3: Inaugural Addresses
This contains all the Inaugural Addresses of all the Presidents of the United States, from George Washington to Barack Obama. Upon the installation of the package, the Inaugural Addresses data are still not installed. But the first time it was ran, it will be downloaded from the Internet.
The addresses are available publicly, and I extracted them from nltk package.
Enter:
>>> trainclassdict = shorttext.data.inaugural()
- shorttext.data.data_retrieval.retrieve_csvdata_as_dict(filepath: str | PathLike) dict[str, list[str]][source]
Retrieve the training data in a CSV file.
Reads a CSV file where the first column contains class labels and the second column contains text data. Returns a dictionary mapping class labels to lists of short texts.
- Args:
filepath: Path to the CSV training data file.
- Returns:
A dictionary with class labels as keys and lists of short texts as values.
- Reference:
Data format inspired by common text classification benchmarks.
- shorttext.data.data_retrieval.retrieve_jsondata_as_dict(filepath: str | PathLike) dict[source]
Retrieve the training data in a JSON file.
Reads a JSON file where class labels are keys and lists of short texts are values. Returns the corresponding dictionary.
- Args:
filepath: Path to the JSON training data file.
- Returns:
A dictionary with class labels as keys and lists of short texts as values.
- shorttext.data.data_retrieval.get_or_download_data(filename: str, origin: str, asbytes: bool = False) TextIOWrapper[source]
Retrieve or download a data file.
Checks if the file exists in the user’s home directory under .shorttext. If not present, downloads from the given origin URL.
- Args:
filename: Name of the file to retrieve. origin: URL to download the file from if not present locally. asbytes: If True, opens the file in binary mode. Default is False.
- Returns:
A file object (text or binary mode depending on asbytes).
- shorttext.data.data_retrieval.subjectkeywords() dict[str, list[str]][source]
Return an example dataset of subjects with keywords.
Returns a small example dataset with three subjects and their corresponding keywords, in the training input format.
- Returns:
A dictionary with subject labels as keys and lists of keywords as values.
- shorttext.data.data_retrieval.inaugural() dict[str, list[str]][source]
Return the Inaugural Addresses of US Presidents.
Returns an example dataset containing the Inaugural Addresses of all Presidents of the United States from George Washington to Barack Obama.
Each key is formatted as “year-lastname” and the value is a list of sentences from the address.
- Returns:
A dictionary with president identifiers as keys and lists of sentences as values.
- Reference:
- shorttext.data.data_retrieval.nihreports(txt_col='PROJECT_TITLE', label_col='FUNDING_ICs', sample_size=512)[source]
Return an example dataset sampled from NIH RePORT.
Returns an example dataset from NIH (National Institutes of Health) RePORT (Research Portfolio Online Reporting Tools) website.
- Args:
- txt_col: Column for text data. Options: ‘PROJECT_TITLE’ or ‘ABSTRACT_TEXT’.
Default: ‘PROJECT_TITLE’.
- label_col: Column for labels. Options: ‘FUNDING_ICs’ or ‘IC_NAME’.
Default: ‘FUNDING_ICs’.
sample_size: Number of samples to return. Set to None for all rows. Default: 512.
- Returns:
A dictionary with IC identifiers as keys and lists of text data as values.
- Reference:
https://exporter.nih.gov/ExPORTER_Catalog.aspx Dataset adapted from the R package textmineR: https://cran.r-project.org/web/packages/textmineR/index.html
- shorttext.data.data_retrieval.merge_cv_dicts(dicts: list[dict[str, list[str]]]) dict[str, list[str]][source]
Merge multiple training data dictionaries.
Combines multiple data dictionaries in the training data format into a single dictionary.
- Args:
- dicts: List of dictionaries to merge, each with class labels
as keys and lists of texts as values.
- Returns:
A merged dictionary with all class labels and texts combined.
- shorttext.data.data_retrieval.yield_crossvalidation_classdicts(classdict: dict[str, list[str]], nb_partitions: int, shuffle: bool = False) Generator[tuple[dict[str, list[str]], dict[str, list[str]]], None, None][source]
Yield training and test data partitions for cross-validation.
Partitions the training data into multiple sets. Each iteration yields a (test_dict, train_dict) pair where one partition is used as test data and the remaining partitions are combined as training data.
- Args:
- classdict: Training data dictionary with class labels as keys
and lists of texts as values.
nb_partitions: Number of partitions to create. shuffle: Whether to shuffle data before partitioning. Default: False.
- Yields:
Tuples of (test_dict, train_dict) for each partition.
User-Provided Training Data
Users can provide their own training data. If it is already in JSON format, it can be loaded easily with standard Python’s json package, or by calling:
>>> trainclassdict = shorttext.data.retrieve_jsondata_as_dict('/path/to/file.json')
However, if it is in CSV format, it has to obey the rules:
there is a heading; and
there are at least two columns: first the labels, and second the short text under the labels (everything being the second column will be neglected).
An excerpt of this type of data is as follow:
subject,content
mathematics,linear algebra
mathematics,topology
mathematics,algebra
...
physics,spontaneous symmetry breaking
physics,atomic molecular and optical physics
physics,quantum chaos
...
theology,divine providence
theology,soteriology
theology,anthropology
To load this data file, just enter:
>>> trainclassdict = shorttext.data.retrieve_csvdata_as_dict('/path/to/file.csv')
- shorttext.data.data_retrieval.retrieve_csvdata_as_dict(filepath: str | PathLike) dict[str, list[str]][source]
Retrieve the training data in a CSV file.
Reads a CSV file where the first column contains class labels and the second column contains text data. Returns a dictionary mapping class labels to lists of short texts.
- Args:
filepath: Path to the CSV training data file.
- Returns:
A dictionary with class labels as keys and lists of short texts as values.
- Reference:
Data format inspired by common text classification benchmarks.
- shorttext.data.data_retrieval.retrieve_jsondata_as_dict(filepath: str | PathLike) dict[source]
Retrieve the training data in a JSON file.
Reads a JSON file where class labels are keys and lists of short texts are values. Returns the corresponding dictionary.
- Args:
filepath: Path to the JSON training data file.
- Returns:
A dictionary with class labels as keys and lists of short texts as values.
- shorttext.data.data_retrieval.get_or_download_data(filename: str, origin: str, asbytes: bool = False) TextIOWrapper[source]
Retrieve or download a data file.
Checks if the file exists in the user’s home directory under .shorttext. If not present, downloads from the given origin URL.
- Args:
filename: Name of the file to retrieve. origin: URL to download the file from if not present locally. asbytes: If True, opens the file in binary mode. Default is False.
- Returns:
A file object (text or binary mode depending on asbytes).
- shorttext.data.data_retrieval.subjectkeywords() dict[str, list[str]][source]
Return an example dataset of subjects with keywords.
Returns a small example dataset with three subjects and their corresponding keywords, in the training input format.
- Returns:
A dictionary with subject labels as keys and lists of keywords as values.
- shorttext.data.data_retrieval.inaugural() dict[str, list[str]][source]
Return the Inaugural Addresses of US Presidents.
Returns an example dataset containing the Inaugural Addresses of all Presidents of the United States from George Washington to Barack Obama.
Each key is formatted as “year-lastname” and the value is a list of sentences from the address.
- Returns:
A dictionary with president identifiers as keys and lists of sentences as values.
- Reference:
- shorttext.data.data_retrieval.nihreports(txt_col='PROJECT_TITLE', label_col='FUNDING_ICs', sample_size=512)[source]
Return an example dataset sampled from NIH RePORT.
Returns an example dataset from NIH (National Institutes of Health) RePORT (Research Portfolio Online Reporting Tools) website.
- Args:
- txt_col: Column for text data. Options: ‘PROJECT_TITLE’ or ‘ABSTRACT_TEXT’.
Default: ‘PROJECT_TITLE’.
- label_col: Column for labels. Options: ‘FUNDING_ICs’ or ‘IC_NAME’.
Default: ‘FUNDING_ICs’.
sample_size: Number of samples to return. Set to None for all rows. Default: 512.
- Returns:
A dictionary with IC identifiers as keys and lists of text data as values.
- Reference:
https://exporter.nih.gov/ExPORTER_Catalog.aspx Dataset adapted from the R package textmineR: https://cran.r-project.org/web/packages/textmineR/index.html
- shorttext.data.data_retrieval.merge_cv_dicts(dicts: list[dict[str, list[str]]]) dict[str, list[str]][source]
Merge multiple training data dictionaries.
Combines multiple data dictionaries in the training data format into a single dictionary.
- Args:
- dicts: List of dictionaries to merge, each with class labels
as keys and lists of texts as values.
- Returns:
A merged dictionary with all class labels and texts combined.
- shorttext.data.data_retrieval.yield_crossvalidation_classdicts(classdict: dict[str, list[str]], nb_partitions: int, shuffle: bool = False) Generator[tuple[dict[str, list[str]], dict[str, list[str]]], None, None][source]
Yield training and test data partitions for cross-validation.
Partitions the training data into multiple sets. Each iteration yields a (test_dict, train_dict) pair where one partition is used as test data and the remaining partitions are combined as training data.
- Args:
- classdict: Training data dictionary with class labels as keys
and lists of texts as values.
nb_partitions: Number of partitions to create. shuffle: Whether to shuffle data before partitioning. Default: False.
- Yields:
Tuples of (test_dict, train_dict) for each partition.
Home: Homepage of shorttext