Character-Based Sequence-to-Sequence (seq2seq) Models
Since release 0.6.0, shorttext supports sequence-to-sequence (seq2seq) learning. While there is a general seq2seq class behind, it provides a character-based seq2seq implementation.
Creating One-hot Vectors
To use it, create an instance of the class shorttext.generators.SentenceToCharVecEncoder:
>>> import numpy as np
>>> import shorttext
>>> from urllib.request import urlopen
>>> chartovec_encoder = shorttext.generators.SentenceToCharVecEncoder.from_pretrained(urlopen('http://norvig.com/big.txt', 'r'))
The above code is the same as
>>> import numpy as np
>>> import shorttext
>>> from urllib.request import urlopen
>>> chartovec_encoder = shorttext.generators.SentenceToCharVecEncoder.from_pretrained(urlopen('http://norvig.com/big.txt', 'r'))
The above code is the same as
>>> import numpy as np
>>> import shorttext
>>> from urllib.request import urlopen
>>> chartovec_encoder = shorttext.generators.SentenceToCharVecEncoder.from_pretrained(urlopen('http://norvig.com/big.txt', 'r'))
The above code is the same as Character to One-Hot Vector .
- class shorttext.generators.charbase.char2vec.SentenceToCharVecEncoder(dictionary: gensim.corpora.Dictionary, signalchar: str = '\n')[source]
Bases:
objectOne-hot encoder for character-level text representations.
Converts sentences into one-hot encoded vectors at the character level. Useful for character-level sequence models.
- Reference:
General architecture inspired by char-RNN and related models.
- __init__(dictionary: gensim.corpora.Dictionary, signalchar: str = '\n')[source]
Initialize the character vector encoder.
- Args:
dictionary: Gensim Dictionary mapping characters to indices. signalchar: Signal character for sequence markers. Default: ‘n’.
- calculate_prelim_vec(sent: str) ndarray[tuple[Any, ...], dtype[float64]][source]
Convert sentence to one-hot character vectors.
- Args:
sent: Input sentence.
- Returns:
One-hot encoded sparse matrix where each row represents a character’s encoding.
- encode_sentence(sent: str, maxlen: int, startsig: bool = False, endsig=False) csc_matrix[source]
Encode a sentence to a sparse character vector matrix.
- Args:
sent: Input sentence to encode. maxlen: Maximum length of the encoded sequence. startsig: Whether to prepend signal character. Default: False. endsig: Whether to append signal character. Default: False.
- Returns:
Sparse matrix representing the sentence with shape (maxlen + startsig + endsig, num_chars).
- encode_sentences(sentences: list[str], maxlen: int, sparse: bool = True, startsig: bool = False, endsig: bool = False) list[ndarray[tuple[Any, ...], dtype[float64]]] | ndarray[tuple[Any, ...], dtype[float64]][source]
Encode multiple sentences into character vectors.
- Args:
sentences: List of sentences to encode. maxlen: Maximum length for each encoded sentence. sparse: Whether to return sparse matrices. Default: True. startsig: Whether to prepend signal character. Default: False. endsig: Whether to append signal character. Default: False.
- Returns:
If sparse=True: list of sparse matrices. If sparse=False: numpy array of shape (n_sentences, maxlen, num_chars).
- classmethod from_pretrained(textfile: str | PathLike, encoding: bool | None = None) Self[source]
Create a SentenceToCharVecEncoder from a text file.
Builds a character dictionary from the given text file and returns an encoder instance.
- Args:
textfile: Path to the text file for building the character dictionary. encoding: Encoding of the text file. Default: None.
- Returns:
A SentenceToCharVecEncoder instance.
- shorttext.generators.charbase.char2vec.initialize_SentenceToCharVecEncoder(textfile: str | PathLike, encoding: bool | None = None) SentenceToCharVecEncoder[source]
Deprecated. Use ~SentenceToCharVecEncoder.from_pretrained.
- shorttext.generators.charbase.char2vec.initSentenceToCharVecEncoder(textfile: str | PathLike, encoding: bool | None = None) SentenceToCharVecEncoder[source]
Deprecated. Use initialize_SentenceToCharVecEncoder instead.
Deprecated since version 4.0.0: This will be removed in 4.1.0.
Training
Then we can train the model by creating an instance of shorttext.generators.CharBasedSeq2SeqGenerator:
>>> latent_dim = 100
>>> seq2seqer = shorttext.generators.CharBasedSeq2SeqGenerator(chartovec_encoder, latent_dim, 120)
And then train this neural network model:
>>> seq2seqer.train(text, epochs=100)
This model takes several hours to train on a laptop.
- class shorttext.generators.seq2seq.charbaseS2S.CharBasedSeq2SeqGenerator(sent2charvec_encoder: SentenceToCharVecEncoder, latent_dim: int, maxlen: int)[source]
Bases:
CompactIOMachineCharacter-based sequence-to-sequence model.
Implements seq2seq at the character level. Uses Seq2SeqWithKeras internally.
- Reference:
Oriol Vinyals, Quoc Le, “A Neural Conversational Model,” arXiv:1506.05869 (2015). https://arxiv.org/abs/1506.05869
- __init__(sent2charvec_encoder: SentenceToCharVecEncoder, latent_dim: int, maxlen: int)[source]
Initialize the generator.
- Args:
sent2charvec_encoder: Character encoder. latent_dim: Number of latent dimensions. maxlen: Maximum length of a sentence.
- compile(optimizer: Literal['sgd', 'rmsprop', 'adagrad', 'adadelta', 'adam', 'adamax', 'nadam'] = 'rmsprop', loss: str = 'categorical_crossentropy') None[source]
Compile the Keras model.
- Args:
optimizer: Optimizer for gradient descent. Options: sgd, rmsprop, adagrad, adadelta, adam, adamax, nadam. Default: rmsprop. loss: Loss function from tensorflow.keras. Default: ‘categorical_crossentropy’.
- prepare_trainingdata(txtseq: str) tuple[ndarray[tuple[Any, ...], dtype[float64]], ndarray[tuple[Any, ...], dtype[float64]], ndarray[tuple[Any, ...], dtype[float64]]][source]
Transform text to numerical vector format.
- Args:
txtseq: Input text.
- Returns:
Tuple of (encoder_input, decoder_input, decoder_output) as rank-3 tensors.
- train(txtseq: str, batch_size: int = 64, epochs: int = 100, optimizer: Literal['sgd', 'rmsprop', 'adagrad', 'adadelta', 'adam', 'adamax', 'nadam'] = 'rmsprop', loss: str = 'categorical_crossentropy') None[source]
Train the character-based seq2seq model.
- Args:
txtseq: Training text. batch_size: Batch size. Default: 64. epochs: Number of epochs. Default: 100. optimizer: Optimizer for gradient descent. Default: rmsprop. loss: Loss function from tensorflow.keras. Default: ‘categorical_crossentropy’.
- decode(txtseq: str, stochastic: bool = True) str[source]
Generate output text from input text.
- Args:
txtseq: Input text. stochastic: Whether to use stochastic sampling. Default: True.
- Returns:
Generated output text.
- savemodel(prefix: str, final: bool = False) None[source]
Save the trained model to files.
For compact save, use save_compact_model instead.
- Args:
prefix: Prefix of the file path. final: Whether the model is final (cannot be further trained). Default: False.
- Raises:
ModelNotTrainedException: If no trained model exists.
Decoding
After training, we can use this class as a generative model of answering questions as a chatbot:
>>> seq2seqer.decode('Happy Holiday!')
It does not give definite answers because there is a stochasticity in the prediction.
Model I/O
This model can be saved by entering:
>>> seq2seqer.save_compact_model('/path/to/norvigtxt_iter5model.bin')
And can be loaded by:
>>> seq2seqer2 = shorttext.generators.seq2seq.charbaseS2S.CharBasedSeq2SeqGenerator.from_pretrained('/path/to/norvigtxt_iter5model.bin')
- class shorttext.generators.seq2seq.charbaseS2S.CharBasedSeq2SeqGenerator(sent2charvec_encoder: SentenceToCharVecEncoder, latent_dim: int, maxlen: int)[source]
Bases:
CompactIOMachineCharacter-based sequence-to-sequence model.
Implements seq2seq at the character level. Uses Seq2SeqWithKeras internally.
- Reference:
Oriol Vinyals, Quoc Le, “A Neural Conversational Model,” arXiv:1506.05869 (2015). https://arxiv.org/abs/1506.05869
- __init__(sent2charvec_encoder: SentenceToCharVecEncoder, latent_dim: int, maxlen: int)[source]
Initialize the generator.
- Args:
sent2charvec_encoder: Character encoder. latent_dim: Number of latent dimensions. maxlen: Maximum length of a sentence.
- compile(optimizer: Literal['sgd', 'rmsprop', 'adagrad', 'adadelta', 'adam', 'adamax', 'nadam'] = 'rmsprop', loss: str = 'categorical_crossentropy') None[source]
Compile the Keras model.
- Args:
optimizer: Optimizer for gradient descent. Options: sgd, rmsprop, adagrad, adadelta, adam, adamax, nadam. Default: rmsprop. loss: Loss function from tensorflow.keras. Default: ‘categorical_crossentropy’.
- prepare_trainingdata(txtseq: str) tuple[ndarray[tuple[Any, ...], dtype[float64]], ndarray[tuple[Any, ...], dtype[float64]], ndarray[tuple[Any, ...], dtype[float64]]][source]
Transform text to numerical vector format.
- Args:
txtseq: Input text.
- Returns:
Tuple of (encoder_input, decoder_input, decoder_output) as rank-3 tensors.
- train(txtseq: str, batch_size: int = 64, epochs: int = 100, optimizer: Literal['sgd', 'rmsprop', 'adagrad', 'adadelta', 'adam', 'adamax', 'nadam'] = 'rmsprop', loss: str = 'categorical_crossentropy') None[source]
Train the character-based seq2seq model.
- Args:
txtseq: Training text. batch_size: Batch size. Default: 64. epochs: Number of epochs. Default: 100. optimizer: Optimizer for gradient descent. Default: rmsprop. loss: Loss function from tensorflow.keras. Default: ‘categorical_crossentropy’.
- decode(txtseq: str, stochastic: bool = True) str[source]
Generate output text from input text.
- Args:
txtseq: Input text. stochastic: Whether to use stochastic sampling. Default: True.
- Returns:
Generated output text.
- savemodel(prefix: str, final: bool = False) None[source]
Save the trained model to files.
For compact save, use save_compact_model instead.
- Args:
prefix: Prefix of the file path. final: Whether the model is final (cannot be further trained). Default: False.
- Raises:
ModelNotTrainedException: If no trained model exists.
- shorttext.generators.seq2seq.charbaseS2S.loadCharBasedSeq2SeqGenerator(path: str | PathLike, compact: bool = True) CharBasedSeq2SeqGenerator[source]
Deprecated. Use ~CharBasedSeq2SeqGenerator.from_pretrained.
Deprecated since version 4.0.1: This will be removed in 5.0.0.
Reference
Aurelien Geron, Hands-On Machine Learning with Scikit-Learn and TensorFlow (Sebastopol, CA: O’Reilly Media, 2017). [O'Reilly]
Ilya Sutskever, James Martens, Geoffrey Hinton, “Generating Text with Recurrent Neural Networks,” ICML (2011). [UToronto]
Ilya Sutskever, Oriol Vinyals, Quoc V. Le, “Sequence to Sequence Learning with Neural Networks,” arXiv:1409.3215 (2014). [arXiv]
Oriol Vinyals, Quoc Le, “A Neural Conversational Model,” arXiv:1506.05869 (2015). [arXiv]
Tom Young, Devamanyu Hazarika, Soujanya Poria, Erik Cambria, “Recent Trends in Deep Learning Based Natural Language Processing,” arXiv:1708.02709 (2017). [arXiv]
Zackary C. Lipton, John Berkowitz, “A Critical Review of Recurrent Neural Networks for Sequence Learning,” arXiv:1506.00019 (2015). [arXiv]