Character to One-Hot Vector
Since version 0.6.1, the package shorttext deals with character-based model. A first important
component of character-based model is to convert every character to a one-hot vector. We provide a class
shorttext.generators.SentenceToCharVecEncoder to deal with this. Thi class incorporates
the OneHotEncoder in scikit-learn and Dictionary in gensim.
To use this, import the packages first:
>>> import numpy as np
>>> import shorttext
Then we incorporate a text file as the source of all characters to be coded. In this case, we choose the file big.txt in Peter Norvig’s websites:
>>> from urllib.request import urlopen
>>> textfile = urlopen('http://norvig.com/big.txt', 'r')
Then instantiate the class using the classmethod SentenceToCharVecEncoder.from_pretrained:
>>> chartovec_encoder = shorttext.generators.SentenceToCharVecEncoder.from_pretrained(textfile)
Now, the object
>>> chartovec_encoder = shorttext.generators.SentenceToCharVecEncoder.from_pretrained(textfile)
Now, the object
>>> chartovec_encoder = shorttext.generators.SentenceToCharVecEncoder.from_pretrained(textfile)
Now, the object chartovec_encoder is an instance of shorttext.generators.SentenceToCharVecEncoder . The
default signal character is n, which is also encoded, and can be checked by looking at the field:
>>> chartovec_encoder.signalchar
We can convert a sentence into a bunch of one-hot vectors in terms of a matrix. For example,
>>> chartovec_encoder.encode_sentence('Maryland blue crab!', 100)
<1x93 sparse matrix of type '<type 'numpy.float64'>'
with 1 stored elements in Compressed Sparse Column format>
This outputs a sparse matrix. Depending on your needs, you can add signal character to the beginning or the end of the sentence in the output matrix by:
>>> chartovec_encoder.encode_sentence('Maryland blue crab!', 100, startsig=True, endsig=False)
>>> chartovec_encoder.encode_sentence('Maryland blue crab!', 100, startsig=False, endsig=True)
We can also convert a list of sentences by
>>> chartovec_encoder.encode_sentences(sentences, 100, startsig=False, endsig=True, sparse=False)
You can decide whether or not to output a sparse matrix by specifiying the parameter sparse.
- class shorttext.generators.charbase.char2vec.SentenceToCharVecEncoder(dictionary: gensim.corpora.Dictionary, signalchar: str = '\n')[source]
Bases:
objectOne-hot encoder for character-level text representations.
Converts sentences into one-hot encoded vectors at the character level. Useful for character-level sequence models.
- Reference:
General architecture inspired by char-RNN and related models.
- __init__(dictionary: gensim.corpora.Dictionary, signalchar: str = '\n')[source]
Initialize the character vector encoder.
- Args:
dictionary: Gensim Dictionary mapping characters to indices. signalchar: Signal character for sequence markers. Default: ‘n’.
- calculate_prelim_vec(sent: str) ndarray[tuple[Any, ...], dtype[float64]][source]
Convert sentence to one-hot character vectors.
- Args:
sent: Input sentence.
- Returns:
One-hot encoded sparse matrix where each row represents a character’s encoding.
- encode_sentence(sent: str, maxlen: int, startsig: bool = False, endsig=False) csc_matrix[source]
Encode a sentence to a sparse character vector matrix.
- Args:
sent: Input sentence to encode. maxlen: Maximum length of the encoded sequence. startsig: Whether to prepend signal character. Default: False. endsig: Whether to append signal character. Default: False.
- Returns:
Sparse matrix representing the sentence with shape (maxlen + startsig + endsig, num_chars).
- encode_sentences(sentences: list[str], maxlen: int, sparse: bool = True, startsig: bool = False, endsig: bool = False) list[ndarray[tuple[Any, ...], dtype[float64]]] | ndarray[tuple[Any, ...], dtype[float64]][source]
Encode multiple sentences into character vectors.
- Args:
sentences: List of sentences to encode. maxlen: Maximum length for each encoded sentence. sparse: Whether to return sparse matrices. Default: True. startsig: Whether to prepend signal character. Default: False. endsig: Whether to append signal character. Default: False.
- Returns:
If sparse=True: list of sparse matrices. If sparse=False: numpy array of shape (n_sentences, maxlen, num_chars).
- classmethod from_pretrained(textfile: str | PathLike, encoding: bool | None = None) Self[source]
Create a SentenceToCharVecEncoder from a text file.
Builds a character dictionary from the given text file and returns an encoder instance.
- Args:
textfile: Path to the text file for building the character dictionary. encoding: Encoding of the text file. Default: None.
- Returns:
A SentenceToCharVecEncoder instance.
- shorttext.generators.charbase.char2vec.initialize_SentenceToCharVecEncoder(textfile: str | PathLike, encoding: bool | None = None) SentenceToCharVecEncoder[source]
Deprecated. Use ~SentenceToCharVecEncoder.from_pretrained.
- shorttext.generators.charbase.char2vec.initSentenceToCharVecEncoder(textfile: str | PathLike, encoding: bool | None = None) SentenceToCharVecEncoder[source]
Deprecated. Use initialize_SentenceToCharVecEncoder instead.
Deprecated since version 4.0.0: This will be removed in 4.1.0.
Reference
Aurelien Geron, Hands-On Machine Learning with Scikit-Learn and TensorFlow (Sebastopol, CA: O’Reilly Media, 2017). [O'Reilly]
Home: Homepage of shorttext