Text Preprocessing

Standard Preprocessor

When the bag-of-words (BOW) model is used to represent the content, it is essential to specify how the text is preprocessed before it is passed to the trainers or the classifiers.

This package provides a standard way of text preprocessing, which goes through the following steps:

  • removing special characters,

  • removing numerals,

  • converting all alphabets to lower cases,

  • removing stop words, and

  • stemming the words (using Snowball Porter stemmer).

To do this, load the preprocesser generator:

>>> from shorttext import standard_text_preprocessor_1

Then define the preprocessor, a function, by just calling:

>>> preprocessor1 = standard_text_preprocessor_1()
specify how the text is preprocessed before it is passed to the trainers or the
classifiers.

This package provides a standard way of text preprocessing, which goes through the following steps:

  • removing special characters,

  • removing numerals,

  • converting all alphabets to lower cases,

  • removing stop words, and

  • stemming the words (using Snowball Porter stemmer).

To do this, load the preprocesser generator:

>>> from shorttext import standard_text_preprocessor_1

Then define the preprocessor, a function, by just calling:

>>> preprocessor1 = standard_text_preprocessor_1()
specify how the text is preprocessed before it is passed to the trainers or the
classifiers.

This package provides a standard way of text preprocessing, which goes through the following steps:

  • removing special characters,

  • removing numerals,

  • converting all alphabets to lower cases,

  • removing stop words, and

  • stemming the words (using Snowball Porter stemmer).

To do this, load the preprocesser generator:

>>> from shorttext import standard_text_preprocessor_1

Then define the preprocessor, a function, by just calling:

>>> preprocessor1 = standard_text_preprocessor_1()
specify how the text is preprocessed before it is passed to the trainers or the
classifiers.

This package provides a standard way of text preprocessing, which goes through the following steps:

  • removing special characters,

  • removing numerals,

  • converting all alphabets to lower cases,

  • removing stop words, and

  • stemming the words (using Snowball Porter stemmer).

To do this, load the preprocesser generator:

>>> from shorttext import standard_text_preprocessor_1

Then define the preprocessor, a function, by just calling:

>>> preprocessor1 = standard_text_preprocessor_1()
specify how the text is preprocessed before it is passed to the trainers or the
classifiers.

This package provides a standard way of text preprocessing, which goes through the following steps:

  • removing special characters,

  • removing numerals,

  • converting all alphabets to lower cases,

  • removing stop words, and

  • stemming the words (using Snowball Porter stemmer).

To do this, load the preprocesser generator:

>>> from shorttext import standard_text_preprocessor_1

Then define the preprocessor, a function, by just calling:

>>> preprocessor1 = standard_text_preprocessor_1()
specify how the text is preprocessed before it is passed to the trainers or the
classifiers.

This package provides a standard way of text preprocessing, which goes through the following steps:

  • removing special characters,

  • removing numerals,

  • converting all alphabets to lower cases,

  • removing stop words, and

  • stemming the words (using Snowball Porter stemmer).

To do this, load the preprocesser generator:

>>> from shorttext import standard_text_preprocessor_1

Then define the preprocessor, a function, by just calling:

>>> preprocessor1 = standard_text_preprocessor_1()
specify how the text is preprocessed before it is passed to the trainers or the
classifiers.

This package provides a standard way of text preprocessing, which goes through the following steps:

  • removing special characters,

  • removing numerals,

  • converting all alphabets to lower cases,

  • removing stop words, and

  • stemming the words (using Snowball Porter stemmer).

To do this, load the preprocesser generator:

>>> from shorttext.utils import standard_text_preprocessor_1

Then define the preprocessor, a function, by just calling:

>>> preprocessor1 = standard_text_preprocessor_1()
shorttext.utils.textpreprocessing.tokenize(s: str) list[str][source]

Tokenize a string by splitting on whitespace.

Args:

s: Input string to tokenize.

Returns:

List of tokens split by whitespace.

class shorttext.utils.textpreprocessing.StemmerSingleton[source]

Bases: object

Singleton class for Porter stemmer.

Provides a singleton instance of the snowball stemmer for English.

__call__(s: str) str[source]

Stem a word using Porter stemmer.

Args:

s: Word to stem.

Returns:

Stemmed word.

shorttext.utils.textpreprocessing.stemword(s: str) str[source]

Stem a word using Porter stemmer.

Args:

s: Word to stem.

Returns:

Stemmed word.

shorttext.utils.textpreprocessing.preprocess_text(text: str, pipeline: list[callable]) str[source]

Preprocess text according to a given pipeline.

Applies a sequence of preprocessing functions to the input text. Each function in the pipeline transforms the text (e.g., stemming, lemmatizing, removing punctuation).

Args:

text: Input text to preprocess. pipeline: List of functions that each transform a text string to another text string.

Returns:

The preprocessed text after applying all pipeline functions.

shorttext.utils.textpreprocessing.tokenize_text(text: str, presplit_pipeline: list[callable], primitize_tokenizer: callable, postsplit_pipeline: list[callable], stopwordsfile: TextIO) list[str][source]

Tokenize text with preprocessing pipelines.

Applies pre-split and post-split pipelines to tokenize text, filtering out stopwords.

Args:

text: Input text to tokenize. presplit_pipeline: List of functions to apply before tokenization. primitize_tokenizer: Tokenizer function to split text into tokens. postsplit_pipeline: List of functions to apply to each token after tokenization. stopwordsfile: File containing stopwords to filter out.

Returns:

List of tokens after preprocessing and stopword filtering.

shorttext.utils.textpreprocessing.text_preprocessor(pipeline: list[callable]) callable[source]

Create a text preprocessor function from a pipeline.

Returns a function that applies the given pipeline to preprocess text. This is a convenience function that wraps preprocess_text with a fixed pipeline.

Args:

pipeline: List of functions that transform text to text.

Returns:

A callable that takes text and returns preprocessed text.

shorttext.utils.textpreprocessing.oldschool_standard_text_preprocessor(stopwordsfile: TextIO) callable[source]

Create a standard text preprocessor.

Returns a text preprocessor with the following steps: - Remove special characters - Remove numerals - Convert to lowercase - Remove stop words - Stem words using Porter stemmer

Args:

stopwordsfile: File object containing stopwords to filter.

Returns:

A callable that takes text and returns preprocessed text.

shorttext.utils.textpreprocessing.standard_text_preprocessor_1() callable[source]

Create a standard text preprocessor using NLTK stopwords.

Returns a text preprocessor with the following steps: - Remove special characters - Remove numerals - Convert to lowercase - Remove stop words (NLTK list) - Stem words using Porter stemmer

Returns:

A callable that takes text and returns preprocessed text.

shorttext.utils.textpreprocessing.standard_text_preprocessor_2() callable[source]

Create a standard text preprocessor with negation-aware stopwords.

Returns a text preprocessor with the following steps: - Remove special characters - Remove numerals - Convert to lowercase - Remove stop words (NLTK list minus negation terms) - Stem words using Porter stemmer

Returns:

A callable that takes text and returns preprocessed text.

shorttext.utils.textpreprocessing.advanced_text_tokenizer_1() callable[source]

Create an advanced text tokenizer.

Returns a tokenizer function that applies preprocessing steps: - Remove special characters - Remove numerals - Convert to lowercase - Stem tokens using Porter stemmer - Filter out negation-aware stopwords

Returns:

A callable that takes text and returns a list of tokens.

It is a function that perform the preprocessing in the steps above:

>>> preprocessor1('Maryland Blue Crab')  # output:  'maryland blue crab'
>>> preprocessor1('filing electronic documents and goes home. eat!!!')   # output: 'file electron document goe home eat'

Customized Text Preprocessor

The standard preprocessor is good for many general natural language processing tasks, but some users may want to define their own preprocessors for their own purposes. This preprocessor is used in topic modeling, and is desired to be a function that takes a string, and returns a string.

If the user wants to develop a preprocessor that contains a few steps, he can make it by providing the pipeline, which is a list of functions that input a string and return a string. For example, let’s develop a preprocessor that 1) convert it to base form if it is a verb, or keep it original; 2) convert it to upper case; and 3) tag the number of characters after each token.

Load the function that generates the preprocessor function:

>>> from shorttext import text_preprocessor

Initialize a WordNet lemmatizer using but some users may want to define their own preprocessors for their own purposes. This preprocessor is used in topic modeling, and is desired to be a function that takes a string, and returns a string.

If the user wants to develop a preprocessor that contains a few steps, he can make it by providing the pipeline, which is a list of functions that input a string and return a string. For example, let’s develop a preprocessor that 1) convert it to base form if it is a verb, or keep it original; 2) convert it to upper case; and 3) tag the number of characters after each token.

Load the function that generates the preprocessor function:

>>> from shorttext import text_preprocessor

Initialize a WordNet lemmatizer using but some users may want to define their own preprocessors for their own purposes. This preprocessor is used in topic modeling, and is desired to be a function that takes a string, and returns a string.

If the user wants to develop a preprocessor that contains a few steps, he can make it by providing the pipeline, which is a list of functions that input a string and return a string. For example, let’s develop a preprocessor that 1) convert it to base form if it is a verb, or keep it original; 2) convert it to upper case; and 3) tag the number of characters after each token.

Load the function that generates the preprocessor function:

>>> from shorttext import text_preprocessor

Initialize a WordNet lemmatizer using but some users may want to define their own preprocessors for their own purposes. This preprocessor is used in topic modeling, and is desired to be a function that takes a string, and returns a string.

If the user wants to develop a preprocessor that contains a few steps, he can make it by providing the pipeline, which is a list of functions that input a string and return a string. For example, let’s develop a preprocessor that 1) convert it to base form if it is a verb, or keep it original; 2) convert it to upper case; and 3) tag the number of characters after each token.

Load the function that generates the preprocessor function:

>>> from shorttext import text_preprocessor

Initialize a WordNet lemmatizer using but some users may want to define their own preprocessors for their own purposes. This preprocessor is used in topic modeling, and is desired to be a function that takes a string, and returns a string.

If the user wants to develop a preprocessor that contains a few steps, he can make it by providing the pipeline, which is a list of functions that input a string and return a string. For example, let’s develop a preprocessor that 1) convert it to base form if it is a verb, or keep it original; 2) convert it to upper case; and 3) tag the number of characters after each token.

Load the function that generates the preprocessor function:

>>> from shorttext import text_preprocessor

Initialize a WordNet lemmatizer using but some users may want to define their own preprocessors for their own purposes. This preprocessor is used in topic modeling, and is desired to be a function that takes a string, and returns a string.

If the user wants to develop a preprocessor that contains a few steps, he can make it by providing the pipeline, which is a list of functions that input a string and return a string. For example, let’s develop a preprocessor that 1) convert it to base form if it is a verb, or keep it original; 2) convert it to upper case; and 3) tag the number of characters after each token.

Load the function that generates the preprocessor function:

>>> from shorttext import text_preprocessor

Initialize a WordNet lemmatizer using but some users may want to define their own preprocessors for their own purposes. This preprocessor is used in topic modeling, and is desired to be a function that takes a string, and returns a string.

If the user wants to develop a preprocessor that contains a few steps, he can make it by providing the pipeline, which is a list of functions that input a string and return a string. For example, let’s develop a preprocessor that 1) convert it to base form if it is a verb, or keep it original; 2) convert it to upper case; and 3) tag the number of characters after each token.

Load the function that generates the preprocessor function:

>>> from shorttext.utils import text_preprocessor

Initialize a WordNet lemmatizer using nltk:

>>> from nltk.stem import WordNetLemmatizer
>>> lemmatizer = WordNetLemmatizer()

Define the pipeline. Functions for each of the steps are:

>>> step1fcn = lambda s: ' '.join([lemmatizer.lemmatize(s1) for s1 in s.split(' ')])
>>> step2fcn = lambda s: s.upper()
>>> step3fcn = lambda s: ' '.join([s1+'-'+str(len(s1)) for s1 in s.split(' ')])

Then the pipeline is:

>>> pipeline = [step1fcn, step2fcn, step3fcn]

The preprocessor function can be generated with the defined pipeline:

>>> preprocessor2 = text_preprocessor(pipeline)

The function preprocessor2 is a function that input a string and returns a string. Some examples are:

>>> preprocessor2('Maryland blue crab in Annapolis')  # output: 'MARYLAND-8 BLUE-4 CRAB-4 IN-2 ANNAPOLIS-9'
>>> preprocessor2('generative adversarial networks')  # output: 'GENERATIVE-10 ADVERSARIAL-11 NETWORK-7'
shorttext.utils.textpreprocessing.tokenize(s: str) list[str][source]

Tokenize a string by splitting on whitespace.

Args:

s: Input string to tokenize.

Returns:

List of tokens split by whitespace.

class shorttext.utils.textpreprocessing.StemmerSingleton[source]

Bases: object

Singleton class for Porter stemmer.

Provides a singleton instance of the snowball stemmer for English.

__call__(s: str) str[source]

Stem a word using Porter stemmer.

Args:

s: Word to stem.

Returns:

Stemmed word.

shorttext.utils.textpreprocessing.stemword(s: str) str[source]

Stem a word using Porter stemmer.

Args:

s: Word to stem.

Returns:

Stemmed word.

shorttext.utils.textpreprocessing.preprocess_text(text: str, pipeline: list[callable]) str[source]

Preprocess text according to a given pipeline.

Applies a sequence of preprocessing functions to the input text. Each function in the pipeline transforms the text (e.g., stemming, lemmatizing, removing punctuation).

Args:

text: Input text to preprocess. pipeline: List of functions that each transform a text string to another text string.

Returns:

The preprocessed text after applying all pipeline functions.

shorttext.utils.textpreprocessing.tokenize_text(text: str, presplit_pipeline: list[callable], primitize_tokenizer: callable, postsplit_pipeline: list[callable], stopwordsfile: TextIO) list[str][source]

Tokenize text with preprocessing pipelines.

Applies pre-split and post-split pipelines to tokenize text, filtering out stopwords.

Args:

text: Input text to tokenize. presplit_pipeline: List of functions to apply before tokenization. primitize_tokenizer: Tokenizer function to split text into tokens. postsplit_pipeline: List of functions to apply to each token after tokenization. stopwordsfile: File containing stopwords to filter out.

Returns:

List of tokens after preprocessing and stopword filtering.

shorttext.utils.textpreprocessing.text_preprocessor(pipeline: list[callable]) callable[source]

Create a text preprocessor function from a pipeline.

Returns a function that applies the given pipeline to preprocess text. This is a convenience function that wraps preprocess_text with a fixed pipeline.

Args:

pipeline: List of functions that transform text to text.

Returns:

A callable that takes text and returns preprocessed text.

shorttext.utils.textpreprocessing.oldschool_standard_text_preprocessor(stopwordsfile: TextIO) callable[source]

Create a standard text preprocessor.

Returns a text preprocessor with the following steps: - Remove special characters - Remove numerals - Convert to lowercase - Remove stop words - Stem words using Porter stemmer

Args:

stopwordsfile: File object containing stopwords to filter.

Returns:

A callable that takes text and returns preprocessed text.

shorttext.utils.textpreprocessing.standard_text_preprocessor_1() callable[source]

Create a standard text preprocessor using NLTK stopwords.

Returns a text preprocessor with the following steps: - Remove special characters - Remove numerals - Convert to lowercase - Remove stop words (NLTK list) - Stem words using Porter stemmer

Returns:

A callable that takes text and returns preprocessed text.

shorttext.utils.textpreprocessing.standard_text_preprocessor_2() callable[source]

Create a standard text preprocessor with negation-aware stopwords.

Returns a text preprocessor with the following steps: - Remove special characters - Remove numerals - Convert to lowercase - Remove stop words (NLTK list minus negation terms) - Stem words using Porter stemmer

Returns:

A callable that takes text and returns preprocessed text.

shorttext.utils.textpreprocessing.advanced_text_tokenizer_1() callable[source]

Create an advanced text tokenizer.

Returns a tokenizer function that applies preprocessing steps: - Remove special characters - Remove numerals - Convert to lowercase - Stem tokens using Porter stemmer - Filter out negation-aware stopwords

Returns:

A callable that takes text and returns a list of tokens.

Tokenization

Users are free to choose any tokenizer they wish. In shorttext, the tokenizer is simply the space delimiter, and can be called:

>>> shorttext.utils.tokenize('Maryland blue crab')   # output: ['Maryland', 'blue', 'crab']

Reference

Christopher Manning, Hinrich Schuetze, Foundations of Statistical Natural Language Processing (Cambridge, MA: MIT Press, 1999). [MIT Press]

“R or Python on Text Mining,” Everything About Data Analytics, WordPress (2015). [WordPress]

Home: Homepage of shorttext