Character to One-Hot Vector¶
Since version 0.6.1, the package shorttext deals with character-based model. A first important
component of character-based model is to convert every character to a one-hot vector. We provide a class
shorttext.generators.SentenceToCharVecEncoder
to deal with this. Thi class incorporates
the OneHotEncoder in scikit-learn and Dictionary in gensim.
To use this, import the packages first:
>>> import numpy as np
>>> import shorttext
Then we incorporate a text file as the source of all characters to be coded. In this case, we choose the file big.txt in Peter Norvig’s websites:
>>> from urllib.request import urlopen
>>> textfile = urlopen('http://norvig.com/big.txt', 'r')
Then instantiate the class using the function shorttext.generators.initSentenceToCharVecEncoder()
:
>>> chartovec_encoder = shorttext.generators.initSentenceToCharVecEncoder(textfile)
Now, the object chartovec_encoder is an instance of shorttext.generators.SentenceToCharVecEncoder
. The
default signal character is n, which is also encoded, and can be checked by looking at the field:
>>> chartovec_encoder.signalchar
We can convert a sentence into a bunch of one-hot vectors in terms of a matrix. For example,
>>> chartovec_encoder.encode_sentence('Maryland blue crab!', 100)
<1x93 sparse matrix of type '<type 'numpy.float64'>'
with 1 stored elements in Compressed Sparse Column format>
This outputs a sparse matrix. Depending on your needs, you can add signal character to the beginning or the end of the sentence in the output matrix by:
>>> chartovec_encoder.encode_sentence('Maryland blue crab!', 100, startsig=True, endsig=False)
>>> chartovec_encoder.encode_sentence('Maryland blue crab!', 100, startsig=False, endsig=True)
We can also convert a list of sentences by
>>> chartovec_encoder.encode_sentences(sentences, 100, startsig=False, endsig=True, sparse=False)
You can decide whether or not to output a sparse matrix by specifiying the parameter sparse.
Reference¶
Aurelien Geron, Hands-On Machine Learning with Scikit-Learn and TensorFlow (Sebastopol, CA: O’Reilly Media, 2017). [O'Reilly]
Home: Homepage of shorttext