Spell Correctors

This package supports the use of spell correctors, because typos are very common in relatively short text data.

There are two types of spell correctors provided: the one described by Peter Norvig (using n-grams Bayesian method), and another by Keisuke Sakaguchi and his colleagues (using semi-character level recurrent neural network).

>>> import shorttext

We use the Norvig’s training corpus as an example. To load it,

>>> from urllib.request import urlopen
>>> text = urlopen('https://norvig.com/big.txt').read()

The developer just has to instantiate the spell corrector, and then train it with a corpus to get a correction model. Then one can use it for correction.

Norvig

Peter Norvig described a spell corrector based on Bayesian approach and edit distance. You can refer to his blog for more information.

>>> norvig_corrector = shorttext.spell.NorvigSpellCorrector()
>>> norvig_corrector.train(text)
>>> norvig_corrector.correct('oranhe')   # gives "orange"
class shorttext.spell.norvig.NorvigSpellCorrector

Spell corrector described by Peter Norvig in his blog. (https://norvig.com/spell-correct.html)

P(word)

Compute the probability of the words randomly sampled from the training corpus.

Parameters:word (str) – a word
Returns:probability of the word sampled randomly in the corpus
Return type:float
candidates(word)

List potential candidates for corrected spelling to the given words.

Parameters:word (str) – a word
Returns:list of recommended corrections
Return type:list
correct(word)

Recommend a spelling correction to the given word

Parameters:word (str) – a word
Returns:recommended correction
Return type:str
known(words)

Filter away the words that are not found in the training corpus.

Parameters:words (list) – list of words
Returns:list of words that can be found in the training corpus
Return type:list
train(text)

Given the text, train the spell corrector.

Parameters:text (str) – training corpus

Sakaguchi (SCRNN - semi-character recurrent neural network)

Keisuke Sakaguchi and his colleagues developed this spell corrector with the insight that most of the typos happen in between the spellings. They developed a recurrent neural network that trains possible change within the spellings. There are six modes:

  • JUMBLE-WHOLE
  • JUMBLE-BEG
  • JUMBLE-END
  • JUMBLE-INT
  • NOISE-INSERT
  • NOISE-DELETE
  • NOISE-REPLACE

The original intent of their work was not to invent a new spell corrector but to study the “Cmabrigde Uinervtisy” effect, but it is nice to see how it can be implemented as a spell corrector.

>>> scrnn_corrector = shorttext.spell.SCRNNSpellCorrector('JUMBLE-WHOLE')
>>> scrnn_corrector.train(text)
>>> scrnn_corrector.correct('oranhe')   # gives "orange"

We can persist the SCRNN corrector for future use:

>>> scrnn_corrector.save_compact_model('/path/to/spellscrnn.bin')

To load,

>>> corrector = shorttext.spell.loadSCRNNSpellCorrector('/path/to/spellscrnn.bin')
class shorttext.spell.sakaguchi.SCRNNSpellCorrector(operation, alph="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz., :;'*!?`$%&(){}[]-/\@_#", specialsignals={'eos': '#', 'number': '@', 'unk': '_'}, concatcharvec_encoder=None, batchsize=1, nb_hiddenunits=650)

scRNN (semi-character-level recurrent neural network) Spell Corrector.

Reference: Keisuke Sakaguchi, Kevin Duh, Matt Post, Benjamin Van Durme, “Robsut Wrod Reocginiton via semi-Character Recurrent Neural Networ,” arXiv:1608.02214 (2016). [arXiv]

correct(word)

Recommend a spell correction to given the word.

Parameters:word (str) – a given word
Returns:recommended correction
Return type:str
Raise:ModelNotTrainedException
loadmodel(prefix)

Load the model.

Parameters:prefix (str) – prefix of the model path
Returns:None
preprocess_text_correct(text)

A generator that output numpy vectors for the text for correction.

ModelNotTrainedException is raised if the model has not been trained.

Parameters:text (str) – text
Returns:generator that outputs the numpy vectors for correction
Return type:generator
Raise:ModelNotTrainedException
preprocess_text_train(text)

A generator that output numpy vectors for the text for training.

Parameters:text (str) – text
Returns:generator that outputs the numpy vectors for training
Return type:generator
savemodel(prefix)

Save the model.

Parameters:prefix (str) – prefix of the model path
Returns:None
train(text, nb_epoch=100, dropout_rate=0.01, optimizer='rmsprop')

Train the scRNN model.

Parameters:
  • text (str) – training corpus
  • nb_epoch (int) – number of epochs (Default: 100)
  • dropout_rate (float) – dropout rate (Default: 0.01)
  • optimizer (str) – optimizer (Default: “rmsprop”)
shorttext.spell.sakaguchi.loadSCRNNSpellCorrector(filepath, compact=True)

Load a pre-trained scRNN spell corrector instance.

Parameters:
  • filepath (str) – path of the model if compact==True; prefix of the model oath if compact==False
  • compact (bool) – whether model file is compact (Default: True)
Returns:

an instance of scRnn spell corrector

Return type:

SCRNNSpellCorrector

Reference

Keisuke Sakaguchi, Kevin Duh, Matt Post, Benjamin Van Durme, “Robsut Wrod Reocginiton via semi-Character Recurrent Neural Networ,” arXiv:1608.02214 (2016). [arXiv]

Peter Norvig, “How to write a spell corrector.” (2016) [Norvig]