Text Preprocessing

Standard Preprocessor

When the bag-of-words (BOW) model is used to represent the content, it is essential to specify how the text is preprocessed before it is passed to the trainers or the classifiers.

This package provides a standard way of text preprocessing, which goes through the following steps:

  • removing special characters,
  • removing numerals,
  • converting all alphabets to lower cases,
  • removing stop words, and
  • stemming the words (using Snowball Porter stemmer).

To do this, load the preprocesser generator:

>>> from shorttext.utils import standard_text_preprocessor_1

Then define the preprocessor, a function, by just calling:

>>> preprocessor1 = standard_text_preprocessor_1()
shorttext.utils.textpreprocessing.standard_text_preprocessor_1()

Return a commonly used text preprocessor.

Return a text preprocessor that is commonly used, with the following steps:

  • removing special characters,
  • removing numerals,
  • converting all alphabets to lower cases,
  • removing stop words (NLTK list), and
  • stemming the words (using Porter stemmer).

This function calls oldschool_standard_text_preprocessor().

Returns:a function that preprocesses text according to the pipeline
Return type:function

It is a function that perform the preprocessing in the steps above:

>>> preprocessor1('Maryland Blue Crab')  # output:  'maryland blue crab'
>>> preprocessor1('filing electronic documents and goes home. eat!!!')   # output: 'file electron document goe home eat'

Customized Text Preprocessor

The standard preprocessor is good for many general natural language processing tasks, but some users may want to define their own preprocessors for their own purposes. This preprocessor is used in topic modeling, and is desired to be a function that takes a string, and returns a string.

If the user wants to develop a preprocessor that contains a few steps, he can make it by providing the pipeline, which is a list of functions that input a string and return a string. For example, let’s develop a preprocessor that 1) convert it to base form if it is a verb, or keep it original; 2) convert it to upper case; and 3) tag the number of characters after each token.

Load the function that generates the preprocessor function:

>>> from shorttext.utils import text_preprocessor

Initialize a WordNet lemmatizer using nltk:

>>> from nltk.stem import WordNetLemmatizer
>>> lemmatizer = WordNetLemmatizer()

Define the pipeline. Functions for each of the steps are:

>>> step1fcn = lambda s: ' '.join([lemmatizer.lemmatize(s1) for s1 in s.split(' ')])
>>> step2fcn = lambda s: s.upper()
>>> step3fcn = lambda s: ' '.join([s1+'-'+str(len(s1)) for s1 in s.split(' ')])

Then the pipeline is:

>>> pipeline = [step1fcn, step2fcn, step3fcn]

The preprocessor function can be generated with the defined pipeline:

>>> preprocessor2 = text_preprocessor(pipeline)

The function preprocessor2 is a function that input a string and returns a string. Some examples are:

>>> preprocessor2('Maryland blue crab in Annapolis')  # output: 'MARYLAND-8 BLUE-4 CRAB-4 IN-2 ANNAPOLIS-9'
>>> preprocessor2('generative adversarial networks')  # output: 'GENERATIVE-10 ADVERSARIAL-11 NETWORK-7'
shorttext.utils.textpreprocessing.text_preprocessor(pipeline)

Return the function that preprocesses text according to the pipeline.

Given the pipeline, which is a list of functions that process an input text to another text (e.g., stemming, lemmatizing, removing punctuations etc.), return a function that preprocesses an input text outlined by the pipeline, essentially a function that runs preprocess_text() with the specified pipeline.

Parameters:pipeline (list) – a list of functions that convert a text to another text
Returns:a function that preprocesses text according to the pipeline
Return type:function

Tokenization

Users are free to choose any tokenizer they wish. In shorttext, the tokenizer is simply the space delimiter, and can be called:

>>> shorttext.utils.tokenize('Maryland blue crab')   # output: ['Maryland', 'blue', 'crab']

Reference

Christopher Manning, Hinrich Schuetze, Foundations of Statistical Natural Language Processing (Cambridge, MA: MIT Press, 1999). [MIT Press]

“R or Python on Text Mining,” Everything About Data Analytics, WordPress (2015). [WordPress]

Home: Homepage of shorttext