Since version 0.6.1, the package shorttext deals with character-based model. A first important component of character-based model is to convert every character to a one-hot vector. We provide a class shorttext.generators.SentenceToCharVecEncoder to deal with this. Thi class incorporates the OneHotEncoder in scikit-learn and Dictionary in gensim.

To use this, import the packages first:

>>> import numpy as np
>>> import shorttext

Then we incorporate a text file as the source of all characters to be coded. In this case, we choose the file big.txt in Peter Norvig’s websites:

>>> from urllib.request import urlopen
>>> textfile = urlopen('', 'r')

Then instantiate the class using the function shorttext.generators.initSentenceToCharVecEncoder():

>>> chartovec_encoder = shorttext.generators.initSentenceToCharVecEncoder(textfile)

Now, the object chartovec_encoder is an instance of shorttext.generators.SentenceToCharVecEncoder . The default signal character is n, which is also encoded, and can be checked by looking at the field:

>>> chartovec_encoder.signalchar

We can convert a sentence into a bunch of one-hot vectors in terms of a matrix. For example,

>>> chartovec_encoder.encode_sentence('Maryland blue crab!', 100)
<1x93 sparse matrix of type '<type 'numpy.float64'>'
        with 1 stored elements in Compressed Sparse Column format>

This outputs a sparse matrix. Depending on your needs, you can add signal character to the beginning or the end of the sentence in the output matrix by:

>>> chartovec_encoder.encode_sentence('Maryland blue crab!', 100, startsig=True, endsig=False)
>>> chartovec_encoder.encode_sentence('Maryland blue crab!', 100, startsig=False, endsig=True)

We can also convert a list of sentences by

>>> chartovec_encoder.encode_sentences(sentences, 100, startsig=False, endsig=True, sparse=False)

You can decide whether or not to output a sparse matrix by specifiying the parameter sparse.

class shorttext.generators.charbase.char2vec.SentenceToCharVecEncoder(dictionary, signalchar='n')

A class that facilitates one-hot encoding from characters to vectors.


Convert the sentence to a one-hot vector.

Parameters:sent (str) – sentence
Returns:a one-hot vector, with each element the code of that character
Return type:numpy.array
encode_sentence(sent, maxlen, startsig=False, endsig=False)

Encode one sentence to a sparse matrix, with each row the expanded vector of each character.

  • sent (str) – sentence
  • maxlen (int) – maximum length of the sentence
  • startsig (bool) – signal character at the beginning of the sentence (Default: False)
  • endsig (bool) – signal character at the end of the sentence (Default: False)

matrix representing the sentence

Return type:


encode_sentences(sentences, maxlen, sparse=True, startsig=False, endsig=False)

Encode many sentences into a rank-3 tensor.

  • sentences (list) – sentences
  • maxlen (int) – maximum length of one sentence
  • sparse (bool) – whether to return a sparse matrix (Default: True)
  • startsig (bool) – signal character at the beginning of the sentence (Default: False)
  • endsig (bool) – signal character at the end of the sentence (Default: False)

rank-3 tensor of the sentences

Return type:

scipy.sparse.csc_matrix or numpy.array

shorttext.generators.charbase.char2vec.initSentenceToCharVecEncoder(textfile, encoding=None)

Instantiate a class of SentenceToCharVecEncoder from a text file.

  • textfile (file) – text file
  • encoding (str) – encoding of the text file (Default: None)

an instance of SentenceToCharVecEncoder

Return type:



