Spell Correctors

This package supports the use of spell correctors, because typos are very common in relatively short text data.

There are two types of spell correctors provided: the one described by Peter Norvig (using n-grams Bayesian method), and another by Keisuke Sakaguchi and his colleagues (using semi-character level recurrent neural network).

>>> import shorttext

We use the Norvig’s training corpus as an example. To load it,

>>> from urllib.request import urlopen
>>> text = urlopen('https://norvig.com/big.txt').read()

The developer just has to instantiate the spell corrector, and then train it with a corpus to get a correction model. Then one can use it for correction.

Norvig

Peter Norvig described a spell corrector based on Bayesian approach and edit distance. You can refer to his blog for more information.

>>> norvig_corrector = shorttext.spell.NorvigSpellCorrector()
>>> norvig_corrector.train(text)
>>> norvig_corrector.correct('oranhe')   # gives "orange"
class shorttext.spell.norvig.NorvigSpellCorrector[source]

Bases: SpellCorrector

Spell corrector based on Peter Norvig’s algorithm.

Uses word frequency counts to suggest corrections for misspelled words by finding edits that exist in the vocabulary.

Reference:

https://norvig.com/spell-correct.html

__init__()[source]

Initialize the spell corrector.

train(text: str) None[source]

Train on a text corpus.

Builds a word frequency dictionary from the input text.

Args:

text: Training text corpus.

P(word: str) float[source]

Compute word probability from the training corpus.

Args:

word: Word to get probability for.

Returns:

Probability of the word appearing in the corpus.

correct(word: str) str[source]

Recommend spelling correction for a word.

Args:

word: Word to correct.

Returns:

Most likely correction, or the original word if no better option.

known(words: list[str]) set[str][source]

Filter words found in the training vocabulary.

Args:

words: List of words to check.

Returns:

Subset of words that appear in the training corpus.

candidates(word: str) Generator[str, None, None][source]

Generate spelling correction candidates.

Checks exact match, then edits of distance 1 and 2.

Args:

word: Word to find candidates for.

Yields:

Viable correction candidates.

Reference

Peter Norvig, “How to write a spell corrector.” (2016) [Norvig]