Character to One-Hot Vector¶
Since version 0.6.1, the package shorttext deals with character-based model. A first important
component of character-based model is to convert every character to a one-hot vector. We provide a class
shorttext.generators.SentenceToCharVecEncoder
to deal with this. Thi class incorporates
the OneHotEncoder in scikit-learn and Dictionary in gensim.
To use this, import the packages first:
>>> import numpy as np
>>> import shorttext
Then we incorporate a text file as the source of all characters to be coded. In this case, we choose the file big.txt in Peter Norvig’s websites:
>>> from urllib.request import urlopen
>>> textfile = urlopen('http://norvig.com/big.txt', 'r')
Then instantiate the class using the function shorttext.generators.initSentenceToCharVecEncoder()
:
>>> chartovec_encoder = shorttext.generators.initSentenceToCharVecEncoder(textfile)
Now, the object chartovec_encoder is an instance of shorttext.generators.SentenceToCharVecEncoder
. The
default signal character is n, which is also encoded, and can be checked by looking at the field:
>>> chartovec_encoder.signalchar
We can convert a sentence into a bunch of one-hot vectors in terms of a matrix. For example,
>>> chartovec_encoder.encode_sentence('Maryland blue crab!', 100)
<1x93 sparse matrix of type '<type 'numpy.float64'>'
with 1 stored elements in Compressed Sparse Column format>
This outputs a sparse matrix. Depending on your needs, you can add signal character to the beginning or the end of the sentence in the output matrix by:
>>> chartovec_encoder.encode_sentence('Maryland blue crab!', 100, startsig=True, endsig=False)
>>> chartovec_encoder.encode_sentence('Maryland blue crab!', 100, startsig=False, endsig=True)
We can also convert a list of sentences by
>>> chartovec_encoder.encode_sentences(sentences, 100, startsig=False, endsig=True, sparse=False)
You can decide whether or not to output a sparse matrix by specifiying the parameter sparse.
-
class
shorttext.generators.charbase.char2vec.
SentenceToCharVecEncoder
(dictionary, signalchar='n')¶ A class that facilitates one-hot encoding from characters to vectors.
-
calculate_prelim_vec
(sent)¶ Convert the sentence to a one-hot vector.
Parameters: sent (str) – sentence Returns: a one-hot vector, with each element the code of that character Return type: numpy.array
-
encode_sentence
(sent, maxlen, startsig=False, endsig=False)¶ Encode one sentence to a sparse matrix, with each row the expanded vector of each character.
Parameters: - sent (str) – sentence
- maxlen (int) – maximum length of the sentence
- startsig (bool) – signal character at the beginning of the sentence (Default: False)
- endsig (bool) – signal character at the end of the sentence (Default: False)
Returns: matrix representing the sentence
Return type: scipy.sparse.csc_matrix
-
encode_sentences
(sentences, maxlen, sparse=True, startsig=False, endsig=False)¶ Encode many sentences into a rank-3 tensor.
Parameters: - sentences (list) – sentences
- maxlen (int) – maximum length of one sentence
- sparse (bool) – whether to return a sparse matrix (Default: True)
- startsig (bool) – signal character at the beginning of the sentence (Default: False)
- endsig (bool) – signal character at the end of the sentence (Default: False)
Returns: rank-3 tensor of the sentences
Return type: scipy.sparse.csc_matrix or numpy.array
-
-
shorttext.generators.charbase.char2vec.
initSentenceToCharVecEncoder
(textfile, encoding=None)¶ Instantiate a class of SentenceToCharVecEncoder from a text file.
Parameters: - textfile (file) – text file
- encoding (str) – encoding of the text file (Default: None)
Returns: an instance of SentenceToCharVecEncoder
Return type:
Reference¶
Aurelien Geron, Hands-On Machine Learning with Scikit-Learn and TensorFlow (Sebastopol, CA: O’Reilly Media, 2017). [O’Reilly]
Home: Homepage of shorttext