Introduction¶
This package shorttext is a Python package that facilitates supervised and unsupervised learning for short text categorization. Due to the sparseness of words and the lack of information carried in the short texts themselves, an intermediate representation of the texts and documents are needed before they are put into any classification algorithm. In this package, it facilitates various types of these representations, including topic modeling and word-embedding algorithms.
The package shorttext runs on Python 3.7, 3.8, and 3.9.
Since release 1.0.0, shorttext runs on Python 2.7, 3.5, and 3.6. Since release 1.0.7, it runs also in Python 3.7. Since release 1.1.7, the support for Python 2.7 was decommissioned. Since release 1.2.3, the support for Python 3.5 is decommissioned. Since release 1.5.0, the support for Python 3.6 is decommissioned.
Characteristics:
- example data provided (including subject keywords and NIH RePORT); (see Data Preparation)
- text preprocessing; (see Text Preprocessing)
- pre-trained word-embedding support; (see Word Embedding Models)
- gensim topic models (LDA, LSI, Random Projections) and autoencoder; (see Supervised Classification with Topics as Features)
- topic model representation supported for supervised learning using scikit-learn; (see Supervised Classification with Topics as Features)
- cosine distance classification; (see Supervised Classification with Topics as Features, Word-Embedding Cosine Similarity Classifier)
- neural network classification (including ConvNet, and C-LSTM); (see Deep Neural Networks with Word-Embedding)
- maximum entropy classification; (see Maximum Entropy (MaxEnt) Classifier)
- metrics of phrases differences, including soft Jaccard score (using Damerau-Levenshtein distance), and Word Mover’s distance (WMD); (see Metrics)
- character-level sequence-to-sequence (seq2seq) learning; (see Character-Based Sequence-to-Sequence (seq2seq) Models)
- spell correction; (see Spell Correctors)
- API for word-embedding algorithm for one-time loading; (see Word Embedding Models in API) and
- Sentence encodings and similarities based on BERT (see Word Embedding Models and Metrics).
Before release 0.7.2, part of the package was implemented using C, and it is interfaced to Python using SWIG (Simplified Wrapper and Interface Generator). Since 1.0.0, these implementations were replaced with Cython.
Author: Kwan-Yuet Ho (LinkedIn, ResearchGate, Twitter)
Home: Homepage of shorttext