When apply machine learning algorithms, to handel words or short texts, we usually need to get their numeric embedding vecotors first.
Some powerfulmethods including using pre-trained deep learning model such as BERT to more semantic embedding. If computation resource is a limit, or we want to have simpler embedding methods, we could try TF-IDF metrics.
Here we introduce a very simple way to combine character level n-gram methods and TF-IDF to convert short texts such as a few words to numeric vectors. Within numeric vectors, we could further apply classification methods such as Gradient Boosted Machine for downstream tasks.
First, let’s review what’s n-gram:
Quote some definition from the Wikipedia N-Gram page:
…an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application… An n-gram of size 1 is referred to as a “unigram”; size 2 is a “bigram”; size 3 is a “trigram”. Larger sizes are sometimes referred to by the value of n, e.g., “four-gram”, “five-gram”, and so on.
The two most common types of N-Grams, by far, are (1) character-level, where the items consist of one or more characters and (2) word-level, where the items consist of one or more words. The size of the item (or token as it’s often called) is defined by n;
Second, what’s TF-IDF metrics?
TF-IDF (Term Frequency - Inverse Document Frequency) encoding is an improved way of BOW (bag of words) which is the same as TF. It considers the frequently seen term in various documents to be less of importance.
TF (Term Frequency): Counts how many term exists in a document
IDF (Inverse Document Frequency): Inverse of the number of documents which contains the term
So TF-IDF is the basically product of TF and IDF metrics.
It turns out to be very simple to implement character level n-gram TF-IDF encoding of short texts by using scikit-learn package.
This means we can easily incorporate the process step in our data process and feature engineering pipeline
Step 1 , Fit character n-gram tf idf vectorizer with training data
# using gram of length at 2 for example |
[' h', ' p', ' w', 'ap', 'ar', 'at', 'ce', 'e ', 'ea', 'ee', 'el', 'eo', 'er', 'fe', 'gr', 'ha', 'he', 'ic', 'l ', 'le', 'm ', 'ni', 'op', 'or', 'pe', 'pl', 'pp', 'py', 're', 'rk', 'rm', 't ', 'th', 'wa', 'we', 'wo']
Step 2, Transform new text to tf-idf weighted vector
new_text = ['pineapple milk'] |
[[0. 0. 0. 0.42176478 0. 0.
0. 0.42176478 0.3325242 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0.42176478 0. 0. 0. 0.
0. 0.42176478 0.42176478 0. 0. 0.
0. 0. 0. 0. 0. 0. ]]