Spell Correctors¶
This package supports the use of spell correctors, because typos are very common in relatively short text data.
There are two types of spell correctors provided: the one described by Peter Norvig (using n-grams Bayesian method), and another by Keisuke Sakaguchi and his colleagues (using semi-character level recurrent neural network).
>>> import shorttext
We use the Norvig’s training corpus as an example. To load it,
>>> from urllib.request import urlopen
>>> text = urlopen('https://norvig.com/big.txt').read()
The developer just has to instantiate the spell corrector, and then train it with a corpus to get a correction model. Then one can use it for correction.
Norvig¶
Peter Norvig described a spell corrector based on Bayesian approach and edit distance. You can refer to his blog for more information.
>>> norvig_corrector = shorttext.spell.NorvigSpellCorrector()
>>> norvig_corrector.train(text)
>>> norvig_corrector.correct('oranhe') # gives "orange"
-
class
shorttext.spell.norvig.
NorvigSpellCorrector
¶ Spell corrector described by Peter Norvig in his blog. (https://norvig.com/spell-correct.html)
-
P
(word)¶ Compute the probability of the words randomly sampled from the training corpus.
Parameters: word (str) – a word Returns: probability of the word sampled randomly in the corpus Return type: float
-
candidates
(word)¶ List potential candidates for corrected spelling to the given words.
Parameters: word (str) – a word Returns: list of recommended corrections Return type: list
-
correct
(word)¶ Recommend a spelling correction to the given word
Parameters: word (str) – a word Returns: recommended correction Return type: str
-
known
(words)¶ Filter away the words that are not found in the training corpus.
Parameters: words (list) – list of words Returns: list of words that can be found in the training corpus Return type: list
-
train
(text)¶ Given the text, train the spell corrector.
Parameters: text (str) – training corpus
-
Sakaguchi (SCRNN - semi-character recurrent neural network)¶
Keisuke Sakaguchi and his colleagues developed this spell corrector with the insight that most of the typos happen in between the spellings. They developed a recurrent neural network that trains possible change within the spellings. There are six modes:
- JUMBLE-WHOLE
- JUMBLE-BEG
- JUMBLE-END
- JUMBLE-INT
- NOISE-INSERT
- NOISE-DELETE
- NOISE-REPLACE
The original intent of their work was not to invent a new spell corrector but to study the “Cmabrigde Uinervtisy” effect, but it is nice to see how it can be implemented as a spell corrector.
>>> scrnn_corrector = shorttext.spell.SCRNNSpellCorrector('JUMBLE-WHOLE')
>>> scrnn_corrector.train(text)
>>> scrnn_corrector.correct('oranhe') # gives "orange"
We can persist the SCRNN corrector for future use:
>>> scrnn_corrector.save_compact_model('/path/to/spellscrnn.bin')
To load,
>>> corrector = shorttext.spell.loadSCRNNSpellCorrector('/path/to/spellscrnn.bin')
-
class
shorttext.spell.sakaguchi.
SCRNNSpellCorrector
(operation, alph="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz., :;'*!?`$%&(){}[]-/\@_#", specialsignals={'eos': '#', 'number': '@', 'unk': '_'}, concatcharvec_encoder=None, batchsize=1, nb_hiddenunits=650)¶ scRNN (semi-character-level recurrent neural network) Spell Corrector.
Reference: Keisuke Sakaguchi, Kevin Duh, Matt Post, Benjamin Van Durme, “Robsut Wrod Reocginiton via semi-Character Recurrent Neural Networ,” arXiv:1608.02214 (2016). [arXiv]
-
correct
(word)¶ Recommend a spell correction to given the word.
Parameters: word (str) – a given word Returns: recommended correction Return type: str Raise: ModelNotTrainedException
-
loadmodel
(prefix)¶ Load the model.
Parameters: prefix (str) – prefix of the model path Returns: None
-
preprocess_text_correct
(text)¶ A generator that output numpy vectors for the text for correction.
ModelNotTrainedException is raised if the model has not been trained.
Parameters: text (str) – text Returns: generator that outputs the numpy vectors for correction Return type: generator Raise: ModelNotTrainedException
-
preprocess_text_train
(text)¶ A generator that output numpy vectors for the text for training.
Parameters: text (str) – text Returns: generator that outputs the numpy vectors for training Return type: generator
-
savemodel
(prefix)¶ Save the model.
Parameters: prefix (str) – prefix of the model path Returns: None
-
train
(text, nb_epoch=100, dropout_rate=0.01, optimizer='rmsprop')¶ Train the scRNN model.
Parameters: - text (str) – training corpus
- nb_epoch (int) – number of epochs (Default: 100)
- dropout_rate (float) – dropout rate (Default: 0.01)
- optimizer (str) – optimizer (Default: “rmsprop”)
-
-
shorttext.spell.sakaguchi.
loadSCRNNSpellCorrector
(filepath, compact=True)¶ Load a pre-trained scRNN spell corrector instance.
Parameters: - filepath (str) – path of the model if compact==True; prefix of the model oath if compact==False
- compact (bool) – whether model file is compact (Default: True)
Returns: an instance of scRnn spell corrector
Return type: