Character-Based Sequence-to-Sequence (seq2seq) Models

Since release 0.6.0, shorttext supports sequence-to-sequence (seq2seq) learning. While there is a general seq2seq class behind, it provides a character-based seq2seq implementation.

Creating One-hot Vectors

To use it, create an instance of the class shorttext.generators.SentenceToCharVecEncoder:

>>> import numpy as np
>>> import shorttext
>>> from urllib.request import urlopen
>>> chartovec_encoder = shorttext.generators.SentenceToCharVecEncoder.from_pretrained(urlopen('http://norvig.com/big.txt', 'r'))

The above code is the same as

>>> import numpy as np
>>> import shorttext
>>> from urllib.request import urlopen
>>> chartovec_encoder = shorttext.generators.SentenceToCharVecEncoder.from_pretrained(urlopen('http://norvig.com/big.txt', 'r'))

The above code is the same as

>>> import numpy as np
>>> import shorttext
>>> from urllib.request import urlopen
>>> chartovec_encoder = shorttext.generators.SentenceToCharVecEncoder.from_pretrained(urlopen('http://norvig.com/big.txt', 'r'))

The above code is the same as Character to One-Hot Vector .

class shorttext.generators.charbase.char2vec.SentenceToCharVecEncoder(dictionary: gensim.corpora.Dictionary, signalchar: str = '\n')[source]

Bases: object

One-hot encoder for character-level text representations.

Converts sentences into one-hot encoded vectors at the character level. Useful for character-level sequence models.

Reference:: General architecture inspired by char-RNN and related models.

__init__(dictionary: gensim.corpora.Dictionary, signalchar: str = '\n')[source]

Initialize the character vector encoder.

Args:: dictionary: Gensim Dictionary mapping characters to indices. signalchar: Signal character for sequence markers. Default: ‘n’.

calculate_prelim_vec(sent: str) → ndarray[tuple[Any, ...], dtype[float64]][source]

Convert sentence to one-hot character vectors.

Args:: sent: Input sentence.
Returns:: One-hot encoded sparse matrix where each row represents a character’s encoding.

encode_sentence(sent: str, maxlen: int, startsig: bool = False, endsig=False) → csc_matrix[source]

Encode a sentence to a sparse character vector matrix.

Args:: sent: Input sentence to encode. maxlen: Maximum length of the encoded sequence. startsig: Whether to prepend signal character. Default: False. endsig: Whether to append signal character. Default: False.
Returns:: Sparse matrix representing the sentence with shape (maxlen + startsig + endsig, num_chars).

encode_sentences(sentences: list[str], maxlen: int, sparse: bool = True, startsig: bool = False, endsig: bool = False) → list[ndarray[tuple[Any, ...], dtype[float64]]] | ndarray[tuple[Any, ...], dtype[float64]][source]

Encode multiple sentences into character vectors.

Args:: sentences: List of sentences to encode. maxlen: Maximum length for each encoded sentence. sparse: Whether to return sparse matrices. Default: True. startsig: Whether to prepend signal character. Default: False. endsig: Whether to append signal character. Default: False.
Returns:: If sparse=True: list of sparse matrices. If sparse=False: numpy array of shape (n_sentences, maxlen, num_chars).

__len__() → int[source]: Return the number of unique characters in the dictionary.

classmethod from_pretrained(textfile: str | PathLike, encoding: bool | None = None) → Self[source]

Create a SentenceToCharVecEncoder from a text file.

Builds a character dictionary from the given text file and returns an encoder instance.

Args:: textfile: Path to the text file for building the character dictionary. encoding: Encoding of the text file. Default: None.
Returns:: A SentenceToCharVecEncoder instance.

shorttext.generators.charbase.char2vec.initialize_SentenceToCharVecEncoder(textfile: str | PathLike, encoding: bool | None = None) → SentenceToCharVecEncoder[source]: Deprecated. Use ~SentenceToCharVecEncoder.from_pretrained.

shorttext.generators.charbase.char2vec.initSentenceToCharVecEncoder(textfile: str | PathLike, encoding: bool | None = None) → SentenceToCharVecEncoder[source]: Deprecated. Use initialize_SentenceToCharVecEncoder instead.

Deprecated since version 4.0.0: This will be removed in 4.1.0.

Training

Then we can train the model by creating an instance of shorttext.generators.CharBasedSeq2SeqGenerator:

>>> latent_dim = 100
>>> seq2seqer = shorttext.generators.CharBasedSeq2SeqGenerator(chartovec_encoder, latent_dim, 120)

And then train this neural network model:

>>> seq2seqer.train(text, epochs=100)

This model takes several hours to train on a laptop.

class shorttext.generators.seq2seq.charbaseS2S.CharBasedSeq2SeqGenerator(sent2charvec_encoder: SentenceToCharVecEncoder, latent_dim: int, maxlen: int)[source]

Bases: CompactIOMachine

Character-based sequence-to-sequence model.

Implements seq2seq at the character level. Uses Seq2SeqWithKeras internally.

Reference:: Oriol Vinyals, Quoc Le, “A Neural Conversational Model,” arXiv:1506.05869 (2015). https://arxiv.org/abs/1506.05869

__init__(sent2charvec_encoder: SentenceToCharVecEncoder, latent_dim: int, maxlen: int)[source]

Initialize the generator.

Args:: sent2charvec_encoder: Character encoder. latent_dim: Number of latent dimensions. maxlen: Maximum length of a sentence.

compile(optimizer: Literal['sgd', 'rmsprop', 'adagrad', 'adadelta', 'adam', 'adamax', 'nadam'] = 'rmsprop', loss: str = 'categorical_crossentropy') → None[source]

Compile the Keras model.

Args:: optimizer: Optimizer for gradient descent. Options: sgd, rmsprop, adagrad, adadelta, adam, adamax, nadam. Default: rmsprop. loss: Loss function from tensorflow.keras. Default: ‘categorical_crossentropy’.

prepare_trainingdata(txtseq: str) → tuple[ndarray[tuple[Any, ...], dtype[float64]], ndarray[tuple[Any, ...], dtype[float64]], ndarray[tuple[Any, ...], dtype[float64]]][source]

Transform text to numerical vector format.

Args:: txtseq: Input text.
Returns:: Tuple of (encoder_input, decoder_input, decoder_output) as rank-3 tensors.

train(txtseq: str, batch_size: int = 64, epochs: int = 100, optimizer: Literal['sgd', 'rmsprop', 'adagrad', 'adadelta', 'adam', 'adamax', 'nadam'] = 'rmsprop', loss: str = 'categorical_crossentropy') → None[source]

Train the character-based seq2seq model.

Args:: txtseq: Training text. batch_size: Batch size. Default: 64. epochs: Number of epochs. Default: 100. optimizer: Optimizer for gradient descent. Default: rmsprop. loss: Loss function from tensorflow.keras. Default: ‘categorical_crossentropy’.

decode(txtseq: str, stochastic: bool = True) → str[source]

Generate output text from input text.

Args:: txtseq: Input text. stochastic: Whether to use stochastic sampling. Default: True.
Returns:: Generated output text.

savemodel(prefix: str, final: bool = False) → None[source]

Save the trained model to files.

For compact save, use save_compact_model instead.

Args:: prefix: Prefix of the file path. final: Whether the model is final (cannot be further trained). Default: False.
Raises:: ModelNotTrainedException: If no trained model exists.

loadmodel(prefix: str) → None[source]

Load a trained model from files.

For compact load, use load_compact_model instead.

Args:: prefix: Prefix of the file path.

classmethod from_pretrained(path: str | PathLike, compact: bool = True) → Self[source]

Load a trained CharBasedSeq2SeqGenerator from file.

Args:: path: Path of the model file. compact: Whether to load a compact model. Default: True.
Returns:: CharBasedSeq2SeqGenerator instance for seq2seq inference.

Decoding

After training, we can use this class as a generative model of answering questions as a chatbot:

>>> seq2seqer.decode('Happy Holiday!')

It does not give definite answers because there is a stochasticity in the prediction.

Model I/O

This model can be saved by entering:

>>> seq2seqer.save_compact_model('/path/to/norvigtxt_iter5model.bin')

And can be loaded by:

>>> seq2seqer2 = shorttext.generators.seq2seq.charbaseS2S.CharBasedSeq2SeqGenerator.from_pretrained('/path/to/norvigtxt_iter5model.bin')

class shorttext.generators.seq2seq.charbaseS2S.CharBasedSeq2SeqGenerator(sent2charvec_encoder: SentenceToCharVecEncoder, latent_dim: int, maxlen: int)[source]

Bases: CompactIOMachine

Character-based sequence-to-sequence model.

Implements seq2seq at the character level. Uses Seq2SeqWithKeras internally.

Reference:: Oriol Vinyals, Quoc Le, “A Neural Conversational Model,” arXiv:1506.05869 (2015). https://arxiv.org/abs/1506.05869

__init__(sent2charvec_encoder: SentenceToCharVecEncoder, latent_dim: int, maxlen: int)[source]

Initialize the generator.

Args:: sent2charvec_encoder: Character encoder. latent_dim: Number of latent dimensions. maxlen: Maximum length of a sentence.

compile(optimizer: Literal['sgd', 'rmsprop', 'adagrad', 'adadelta', 'adam', 'adamax', 'nadam'] = 'rmsprop', loss: str = 'categorical_crossentropy') → None[source]

Compile the Keras model.

Args:: optimizer: Optimizer for gradient descent. Options: sgd, rmsprop, adagrad, adadelta, adam, adamax, nadam. Default: rmsprop. loss: Loss function from tensorflow.keras. Default: ‘categorical_crossentropy’.

prepare_trainingdata(txtseq: str) → tuple[ndarray[tuple[Any, ...], dtype[float64]], ndarray[tuple[Any, ...], dtype[float64]], ndarray[tuple[Any, ...], dtype[float64]]][source]

Transform text to numerical vector format.

Args:: txtseq: Input text.
Returns:: Tuple of (encoder_input, decoder_input, decoder_output) as rank-3 tensors.

train(txtseq: str, batch_size: int = 64, epochs: int = 100, optimizer: Literal['sgd', 'rmsprop', 'adagrad', 'adadelta', 'adam', 'adamax', 'nadam'] = 'rmsprop', loss: str = 'categorical_crossentropy') → None[source]

Train the character-based seq2seq model.

Args:: txtseq: Training text. batch_size: Batch size. Default: 64. epochs: Number of epochs. Default: 100. optimizer: Optimizer for gradient descent. Default: rmsprop. loss: Loss function from tensorflow.keras. Default: ‘categorical_crossentropy’.

decode(txtseq: str, stochastic: bool = True) → str[source]

Generate output text from input text.

Args:: txtseq: Input text. stochastic: Whether to use stochastic sampling. Default: True.
Returns:: Generated output text.

savemodel(prefix: str, final: bool = False) → None[source]

Save the trained model to files.

For compact save, use save_compact_model instead.

Args:: prefix: Prefix of the file path. final: Whether the model is final (cannot be further trained). Default: False.
Raises:: ModelNotTrainedException: If no trained model exists.

loadmodel(prefix: str) → None[source]

Load a trained model from files.

For compact load, use load_compact_model instead.

Args:: prefix: Prefix of the file path.

classmethod from_pretrained(path: str | PathLike, compact: bool = True) → Self[source]

Load a trained CharBasedSeq2SeqGenerator from file.

Args:: path: Path of the model file. compact: Whether to load a compact model. Default: True.
Returns:: CharBasedSeq2SeqGenerator instance for seq2seq inference.

shorttext.generators.seq2seq.charbaseS2S.loadCharBasedSeq2SeqGenerator(path: str | PathLike, compact: bool = True) → CharBasedSeq2SeqGenerator[source]: Deprecated. Use ~CharBasedSeq2SeqGenerator.from_pretrained.

Deprecated since version 4.0.1: This will be removed in 5.0.0.

Reference

Aurelien Geron, Hands-On Machine Learning with Scikit-Learn and TensorFlow (Sebastopol, CA: O’Reilly Media, 2017). [O'Reilly]

Ilya Sutskever, James Martens, Geoffrey Hinton, “Generating Text with Recurrent Neural Networks,” ICML (2011). [UToronto]

Ilya Sutskever, Oriol Vinyals, Quoc V. Le, “Sequence to Sequence Learning with Neural Networks,” arXiv:1409.3215 (2014). [arXiv]

Oriol Vinyals, Quoc Le, “A Neural Conversational Model,” arXiv:1506.05869 (2015). [arXiv]

Tom Young, Devamanyu Hazarika, Soujanya Poria, Erik Cambria, “Recent Trends in Deep Learning Based Natural Language Processing,” arXiv:1708.02709 (2017). [arXiv]

Zackary C. Lipton, John Berkowitz, “A Critical Review of Recurrent Neural Networks for Sequence Learning,” arXiv:1506.00019 (2015). [arXiv]