Character to One-Hot Vector¶

Since version 0.6.1, the package shorttext deals with character-based model. A first important component of character-based model is to convert every character to a one-hot vector. We provide a class shorttext.generators.SentenceToCharVecEncoder to deal with this. Thi class incorporates the OneHotEncoder in scikit-learn and Dictionary in gensim.

To use this, import the packages first:

>>> import numpy as np
>>> import shorttext


Then we incorporate a text file as the source of all characters to be coded. In this case, we choose the file big.txt in Peter Norvig’s websites:

>>> from urllib.request import urlopen
>>> textfile = urlopen('http://norvig.com/big.txt', 'r')


Then instantiate the class using the function shorttext.generators.initSentenceToCharVecEncoder():

>>> chartovec_encoder = shorttext.generators.initSentenceToCharVecEncoder(textfile)


Now, the object chartovec_encoder is an instance of shorttext.generators.SentenceToCharVecEncoder . The default signal character is n, which is also encoded, and can be checked by looking at the field:

>>> chartovec_encoder.signalchar


We can convert a sentence into a bunch of one-hot vectors in terms of a matrix. For example,

>>> chartovec_encoder.encode_sentence('Maryland blue crab!', 100)
<1x93 sparse matrix of type '<type 'numpy.float64'>'
with 1 stored elements in Compressed Sparse Column format>


This outputs a sparse matrix. Depending on your needs, you can add signal character to the beginning or the end of the sentence in the output matrix by:

>>> chartovec_encoder.encode_sentence('Maryland blue crab!', 100, startsig=True, endsig=False)
>>> chartovec_encoder.encode_sentence('Maryland blue crab!', 100, startsig=False, endsig=True)


We can also convert a list of sentences by

>>> chartovec_encoder.encode_sentences(sentences, 100, startsig=False, endsig=True, sparse=False)


You can decide whether or not to output a sparse matrix by specifiying the parameter sparse.

class shorttext.generators.charbase.char2vec.SentenceToCharVecEncoder(dictionary, signalchar='n')

A class that facilitates one-hot encoding from characters to vectors.

calculate_prelim_vec(sent)

Convert the sentence to a one-hot vector.

Parameters: sent (str) – sentence a one-hot vector, with each element the code of that character numpy.array
encode_sentence(sent, maxlen, startsig=False, endsig=False)

Encode one sentence to a sparse matrix, with each row the expanded vector of each character.

Parameters: sent (str) – sentence maxlen (int) – maximum length of the sentence startsig (bool) – signal character at the beginning of the sentence (Default: False) endsig (bool) – signal character at the end of the sentence (Default: False) matrix representing the sentence scipy.sparse.csc_matrix
encode_sentences(sentences, maxlen, sparse=True, startsig=False, endsig=False)

Encode many sentences into a rank-3 tensor.

Parameters: sentences (list) – sentences maxlen (int) – maximum length of one sentence sparse (bool) – whether to return a sparse matrix (Default: True) startsig (bool) – signal character at the beginning of the sentence (Default: False) endsig (bool) – signal character at the end of the sentence (Default: False) rank-3 tensor of the sentences scipy.sparse.csc_matrix or numpy.array
shorttext.generators.charbase.char2vec.initSentenceToCharVecEncoder(textfile, encoding=None)

Instantiate a class of SentenceToCharVecEncoder from a text file.

Parameters: textfile (file) – text file encoding (str) – encoding of the text file (Default: None) an instance of SentenceToCharVecEncoder SentenceToCharVecEncoder

Reference¶

Aurelien Geron, Hands-On Machine Learning with Scikit-Learn and TensorFlow (Sebastopol, CA: O’Reilly Media, 2017). [O’Reilly]