Character to One-Hot Vector

Since version 0.6.1, the package shorttext deals with character-based model. A first important component of character-based model is to convert every character to a one-hot vector. We provide a class shorttext.generators.SentenceToCharVecEncoder to deal with this. Thi class incorporates the OneHotEncoder in scikit-learn and Dictionary in gensim.

To use this, import the packages first:

>>> import numpy as np
>>> import shorttext

Then we incorporate a text file as the source of all characters to be coded. In this case, we choose the file big.txt in Peter Norvig’s websites:

>>> from urllib.request import urlopen
>>> textfile = urlopen('http://norvig.com/big.txt', 'r')

Then instantiate the class using the classmethod SentenceToCharVecEncoder.from_pretrained:

>>> chartovec_encoder = shorttext.generators.SentenceToCharVecEncoder.from_pretrained(textfile)

Now, the object

>>> chartovec_encoder = shorttext.generators.SentenceToCharVecEncoder.from_pretrained(textfile)

Now, the object

>>> chartovec_encoder = shorttext.generators.SentenceToCharVecEncoder.from_pretrained(textfile)

Now, the object chartovec_encoder is an instance of shorttext.generators.SentenceToCharVecEncoder . The default signal character is n, which is also encoded, and can be checked by looking at the field:

>>> chartovec_encoder.signalchar

We can convert a sentence into a bunch of one-hot vectors in terms of a matrix. For example,

>>> chartovec_encoder.encode_sentence('Maryland blue crab!', 100)
<1x93 sparse matrix of type '<type 'numpy.float64'>'
        with 1 stored elements in Compressed Sparse Column format>

This outputs a sparse matrix. Depending on your needs, you can add signal character to the beginning or the end of the sentence in the output matrix by:

>>> chartovec_encoder.encode_sentence('Maryland blue crab!', 100, startsig=True, endsig=False)
>>> chartovec_encoder.encode_sentence('Maryland blue crab!', 100, startsig=False, endsig=True)

We can also convert a list of sentences by

>>> chartovec_encoder.encode_sentences(sentences, 100, startsig=False, endsig=True, sparse=False)

You can decide whether or not to output a sparse matrix by specifiying the parameter sparse.

class shorttext.generators.charbase.char2vec.SentenceToCharVecEncoder(dictionary: gensim.corpora.Dictionary, signalchar: str = '\n')[source]

Bases: object

One-hot encoder for character-level text representations.

Converts sentences into one-hot encoded vectors at the character level. Useful for character-level sequence models.

Reference:

General architecture inspired by char-RNN and related models.

__init__(dictionary: gensim.corpora.Dictionary, signalchar: str = '\n')[source]

Initialize the character vector encoder.

Args:

dictionary: Gensim Dictionary mapping characters to indices. signalchar: Signal character for sequence markers. Default: ‘n’.

calculate_prelim_vec(sent: str) ndarray[tuple[Any, ...], dtype[float64]][source]

Convert sentence to one-hot character vectors.

Args:

sent: Input sentence.

Returns:

One-hot encoded sparse matrix where each row represents a character’s encoding.

encode_sentence(sent: str, maxlen: int, startsig: bool = False, endsig=False) csc_matrix[source]

Encode a sentence to a sparse character vector matrix.

Args:

sent: Input sentence to encode. maxlen: Maximum length of the encoded sequence. startsig: Whether to prepend signal character. Default: False. endsig: Whether to append signal character. Default: False.

Returns:

Sparse matrix representing the sentence with shape (maxlen + startsig + endsig, num_chars).

encode_sentences(sentences: list[str], maxlen: int, sparse: bool = True, startsig: bool = False, endsig: bool = False) list[ndarray[tuple[Any, ...], dtype[float64]]] | ndarray[tuple[Any, ...], dtype[float64]][source]

Encode multiple sentences into character vectors.

Args:

sentences: List of sentences to encode. maxlen: Maximum length for each encoded sentence. sparse: Whether to return sparse matrices. Default: True. startsig: Whether to prepend signal character. Default: False. endsig: Whether to append signal character. Default: False.

Returns:

If sparse=True: list of sparse matrices. If sparse=False: numpy array of shape (n_sentences, maxlen, num_chars).

__len__() int[source]

Return the number of unique characters in the dictionary.

classmethod from_pretrained(textfile: str | PathLike, encoding: bool | None = None) Self[source]

Create a SentenceToCharVecEncoder from a text file.

Builds a character dictionary from the given text file and returns an encoder instance.

Args:

textfile: Path to the text file for building the character dictionary. encoding: Encoding of the text file. Default: None.

Returns:

A SentenceToCharVecEncoder instance.

shorttext.generators.charbase.char2vec.initialize_SentenceToCharVecEncoder(textfile: str | PathLike, encoding: bool | None = None) SentenceToCharVecEncoder[source]

Deprecated. Use ~SentenceToCharVecEncoder.from_pretrained.

shorttext.generators.charbase.char2vec.initSentenceToCharVecEncoder(textfile: str | PathLike, encoding: bool | None = None) SentenceToCharVecEncoder[source]

Deprecated. Use initialize_SentenceToCharVecEncoder instead.

Deprecated since version 4.0.0: This will be removed in 4.1.0.

Reference

Aurelien Geron, Hands-On Machine Learning with Scikit-Learn and TensorFlow (Sebastopol, CA: O’Reilly Media, 2017). [O'Reilly]

Home: Homepage of shorttext