The BoW model treats text as a collection of words and disregards the order and grammar of the words.
It focuses solely on the frequency or presence of words in the given text.

💡

Bag-of-Words model is the most common fixed-length vector representation for texts (doesn't care about words order, just counts their multiplicity)

It is called a “bag” of words because any information about the order or structure of words in the document is discarded.
The model is only concerned with whether the word occurs in the document, not where in the document.
Here, we will use the words tokens, terms, and feature names interchangeably.
There are different approaches to implementing BoW:
1. Counting BoW (CountVectorizer)
  1. Uni-Grams
  2. N-Grams
  3. Character-level N-Grams
  4. Binary BoW
2. Term Frequency-Inverse Document Frequency (TF-IDF)
3. HashingVectorizer

`1. Counting BoW (CountVectorizer)`

🚀

In a nutshell, it counts the occurrences of each token (words in our case) in the sentence, then creates an embedding vector for this sentence.

Steps:

Create the Corpus —> Split the text into sentences (or into paragraphs)
Some Preprocessing (Remove Stop Words, Remove Punctuation, Lowercase)
Tokenization —> Count each word as a token
Create Frequency Distribution —> Count the frequency of each token
- Each element of the vector represents the count or presence of a word from the vocabulary in the document.

Examples:

fit_transform() (1) Learns the dict vocab (2) returns the document-term matrix.
get_feature_names_out() returns the words of our vocabulary → (Only the words that were in the corpus.)
For efficiency, CountVectorizer returns a sparse representation:
- scipy.sparse.csr_matrix
- It saves memory and speeds up algebraic operations.
- toarray(): converts from csr matrix to normal array
vectorizer.vocabulary_get('document') returns the index of that word in the vocabulary which is 1 as shown in get_feature_names_out
Example: the 2nd sentence has "document" two times, so the count of the 2ⁿᵈ index is 2.

Note: In the above example, our vocab is only 9 specific words → so if we tried vectorizer.transform(['Smth completely new']).to array() ⇒ You get an array of zeros because none of these words are in my vocab.

Note: CountVectorizer requires that each token is 2 or more alphanumeric characters unless you specify analyzer=’char’ as in Character-level N-Grams

python

CountVectorizer(

max_features: Build a vocab of (n) number of features (tokens) — it automatically chooses the top (n) occurrences of the corpus. --

These features/tokens could be uni-grams, bigrams, or more…

lowercase: Default is True, convert all before tokenizing

python

documents = [
    "The cat sat on the mat.",
    "The dog jumped over the fence.",
    "The bird is singing in the tree."
]

# Predefined vocabulary

python

# Output
['cat' 'dog' 'bird']
[[1 0 0]
 [0 1 0]
 [0 0 1]]

vocabulary: the default is None and it builds by itself, but we can pass the ones we are interested in.

If you do not provide an apriori dictionary, then the number of features will be equal to the vocabulary size found by analyzing the data.

max_df: It takes int or float [0.0, 1.0]

min_df: The opposite of max_df

python

# !pip install nltk
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
english_stop_words = stopwords.words('english')

vectorizer = CountVectorizer(

stop_words: takes a list of words to be ignored — Default is None

It could be our own list or a prepared list of stop words from nltk.

inverse_transform: takes the document-token matrix, and reconstructs the original documents.

`1.b. N-grams`

N-Grams capture the co-occurrence of consecutive words in a document.
Instead of representing individual words as tokens, N-Grams represent sequences of N consecutive words as tokens.
Bigrams represent pairs of words, trigrams represent triplets of words, and so on.
This approach captures local word order and can help capture some level of context.
If we have the sentence "The cat sat on the mat" and we consider bigrams (N=2), the N-Grams would be "The cat," "cat sat," "sat on," and "on the," "the mat."
Similarly, for trigrams (N=3), the N-Grams would be "The cat sat," "cat sat on," "sat on the," and "on the mat.”
The choice of the N value (e.g., bigrams, trigrams, etc.) depends on the specific task and the desired level of context to be captured.
- Bigrams tend to focus on immediate word relationships, while trigrams and four-grams capture more complex relationships.

python

from sklearn.feature_extraction.text import CountVectorizer

# Example documents
documents = [
    "The cat sat on the mat.",
    "The dog jumped over the fence.",
    "The bird is singing in the tree."
]

vectorizer = CountVectorizer(

python

# Output
['bird is', 'bird is singing', 'cat sat', 'cat sat on', 'dog jumped', 'dog jumped over', 'in the', 'in the tree', 'is singing', 'is singing in', 'jumped over', 'jumped over the', 'on the', 'on the mat', 'over the', 'over the fence', 'sat on', 'sat on the', 'singing in', 'singing in the', 'the bird', 'the bird is', 'the cat', 'the cat sat', 'the dog', 'the dog jumped', 'the fence', 'the mat', 'the tree']
[[0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 1 1 0 0 0 1 0]
 [0 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 1 1 1 0 0]
 [1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 1]]

ngram_range: the default is (1, 1) which means only uni-grams.

if (1,2) then means uni-grams and bigrams
if (2,2) then means bigrams only
if (1,3) then means uni-grams, bigrams, and trigrams

`1.c. Character-level N-Grams`

TBH: Not useful — Instead of considering words, character-level N-Grams represent each token as a character.

python

vectorizer = CountVectorizer(

`1.d. Binary BoW`

TBH: Not useful — In this approach, the presence or absence of a word in a document is represented by a binary value (0 or 1).
No consideration of the frequency.

python

documents = [
    "I like to play soccer as I am strong",
    "I enjoy playing soccer with my close close friends"
]
vectorizer = CountVectorizer(

python

# Output
['am' 'as' 'close' 'enjoy' 'friends' 'like' 'my' 'play' 'playing' 'soccer' 'strong' 'to' 'with']
[[1 1 0 0 0 1 0 1 0 1 1 1 0]
 [0 0 1 1 1 0 1 0 1 1 0 0 1]]

Disadvantages of Counting BoW

Order Independence: BoW disregards the order and grammar of words in the text, which means it cannot capture sequential information or word dependencies.
- Two sentences with different meanings have the same representation ⇒ "The food was good, not bad at all" & "The food was bad, not good at all”
Information Loss: BoW focuses solely on word frequencies; No knowledge of words semantics or linguistics.
- Ex: "strong" & "sturdy" are not close in distance (vector space). —— If embedded each will have a vector [1].
- All words have the same sentence — For example, the word “sturdy” is equally distant to “VA” and “strong”.
Vocabulary Size (High Dimensionality) If we have 20 unique words, then the vector length size is 20. Imagine having 5000 unique words!
- High-dimensional feature vectors and potential memory and computational challenges.
- Traditional neural networks cannot keep track of long-term dependencies in a sentence/paragraph.
Despite its simplicity and limitations, Counting BoW remains a widely used technique in NLP.

`2. Term Frequency-Inverse Document Frequency (TF-IDF)`

TF-IDF is a numerical representation that reflects the importance of a term in a document within a corpus.

Term Frequency (TF): It measures the frequency of a term within a document. The intuition behind TF is that the more frequently a term appears in a document, the more relevant & indicative it is to that document.

Inverse Document Frequency (IDF): It measures the rarity of a term across the entire corpus. The intuition behind IDF is that terms that are less frequent among documents are more discriminative and provide more meaningful information compared to common terms.

IDF has no effect on retrieving documents for one-term queries — IDF affects the ranking of documents for queries with at least two terms.

💡

TF-IDF is the product of two statistics, term frequency, and inverse document frequency. There are various ways for determining the exact values of both statistics.

💡

If a token is more representative and indicative of the statement, then we expect TF ↑ and IDF ↑

Steps

Create the Corpus —> Split the text into sentences (or into paragraphs)
Some Preprocessing (Remove Stop Words, Remove Punctuation, Lowercase)
Tokenization —> Count each word as a token
TF Calculation —> For each document, the TF of each token is calculated.
IDF Calculation —> IDF is calculated for each token across the entire corpus.
TF-IDF Calculation —> The TF-IDF score for each token in each document is computed.
Vector Normalization —> Each output vector will have a unit norm using L2 Norm to ensure that the resulting vector representations have consistent scales.
- It helps eliminate the potential biases introduced by document length variations.
- This normalization process allows for fair comparisons between vectors, focusing on the direction of the vectors rather than their magnitudes (document length variations).
- Overall, Vector Normalization allows for fair and meaningful comparisons between documents, facilitating various text analysis tasks such as similarity search, document clustering, and classification.

Some Notations:

Term Frequency

Raw Term Counting: $TF(t,d) = f(t,d)$
Term Frequency with Normalization:
${TF}(t,d) = \frac{f(t,d)}{len(d)}$
- We divided by the number of words in (d) so that $TF(t, d)$ is normalized across the entire corpus — Value between 1 and 0.
- This normalization accounts for variations in document lengths and provides a relative measure of the frequency of the term within the document.
- Example: If the term frequency of the same word, 'computer', for doc1 is 1 million and doc2 is 2 million, at this point, there is not much difference in terms of relevancy anymore because they both contain a very high count for term 'computer'. — Using Normalization they would be mostly similar.
Logarithmic Term Frequency:
This helps to dampen the impact of highly frequent terms and reduces the skewness caused by a few dominant terms.
$TF(t,d) = 1 + \log_{10}(f(t,d))$
In this equation, f(t,d) represents the raw count of the term "t" in document "d". The logarithmic transformation is applied to the term count to calculate the logarithmic term frequency.
Augmented Term Frequency: $TF(t,d) = 0.5 + \frac{0.5 \times f(t,d)}{\max{\text{f}(w,d) : w \in d}}$
Here, f(t,d) represents the raw count of the term "t" in document "d". The maximum frequency of any term in the document, $\max{\text{f}(w,d) : w \in d}$ , is calculated and used to normalize the term frequency.
${TF}(t,d) = \frac{f(t,d)}{\text{No. of words in d}}$

❤️

Logarithms are mathematical functions that have the property of compressing large values and expanding small values.

Inverse Document Frequency

Standard IDF:
- We want a bigger number if a word is mentioned in a few documents because it means it’s important to us.
- If N=10, and the word is mentioned in this document only —> Higher Value (For example, 10/1 = 10)
- If N=10, and the word is mentioned in 7 documents —> Lower Value (For example, 10/7 = 1.43)
- So, we agree that it’s good to divide the Total Number of Documents in the corpus (N) By the number of documents that contain the term (df).
$IDF(t) = \frac{N}{df(t)}$
- The standard IDF just adds $log$ to this equation, for the same reasons as in TF.
- By taking the logarithm, the IDF values are compressed, meaning that the differences between IDF values for different terms become more balanced.
- If N=10, and the word is mentioned in this document only —> Higher Value (For example, log(10/1) = 1)
- If N=10, and the word is mentioned in 7 documents —> Lower Value (For example, log(10/7) = 0.155)
- This compression helps prevent a few terms with very high document frequencies from dominating the IDF scores — It narrows the gap.
$IDF(t) = log\left(\frac{N}{df(t)}\right)$
- What if a term appears in all documents, then log(1) = 0, and TF-IDF will be zero although TF is not —— So we just add 1 rather than ignoring totally:
$IDF(t) = log\left(\frac{N}{df(t)}\right)+1$
Smooth IDF:
- It’s solely for the case of terms that appear in no documents —> We want to avoid division by 0.
- Why would it not appear in any document??? — It could be that we just have the term in apriori vocabulary!
- We added 1+ to the numerator and denominator as if there's an extra document that contains every term in the corpus exactly once.
$IDF(t) = log\left(\frac{1+N}{1+ df(t) }\right)+1$
Maximum IDF: The maximum IDF function assigns a constant value to all terms, regardless of their document frequency. It does not consider the actual document frequencies of terms. The formula for maximum IDF is as follows: IDF(t) = \log(N)

The intuition behind maximum IDF is to treat all terms equally, assuming that their rarity or importance is not influenced by their occurrence in the corpus. It simplifies the IDF calculation and can be useful in certain scenarios.

Probabilistic IDF: The probabilistic IDF function incorporates the probability of a term occurring in a document by considering the term frequency (TF) within the document. It combines the IDF and TF components to calculate a relevance score. The formula for probabilistic IDF is as follows: IDF(t) = \log\left(\frac{N - df(t) + 0.5}{df(t) + 0.5}\right)

The intuition behind probabilistic IDF is to capture the relevance of a term by balancing its presence in the corpus (IDF) and within a specific document (TF). The formula aims to provide a more nuanced measure of term importance.

TF-IDF Final Equation

Unsupported block type: unsupported

TF-IDF(t,d) = TF(t,d) * IDF(t) = \frac{f(t,d)}{len(df)}* (log\left(\frac{N}{df(t)}\right)+1)

Unsupported block type: unsupported

Example

Q: A document has the following feqs for three terms: f(’A’, d) = 3, f(’B’, d) = 2, f(’C’, d) = 1 —— The corpus is 1000 documents —— we have also df('A') = 50, df('B') = 1300, df('C') = 250 — What is the TF-IDF Vector for this document:

Sol:

For Term ‘A’ —> TF-IDF(’A’) = (3/6) * (log(1000/50)+1) = 0.65 —— For Term ‘B’ —> TF-IDF(’B’) = (2/6) * (log(1000/1300)+1) = 0.295 —— For Term ‘C’ —> TF-IDF(’C’) = (1/6) * (log(250/1300)+1) = 0.0473
Then our TF-IDF vector (embedding) is [1.15, 0.295, 0.0473]
If we wanna do L2 Normalization, then $L2\_norm = \sqrt{(1.15^2) + (0.295^2) + (0.0473^2))}$ = 1.239 —> [1.15/1.239, 0.295/1.239, 0.0473/1.239] = [0.928, 0.238, 0.038]

python

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
documents = [
    "I enjoy playing soccer.",
    "Soccer is a popular sport worldwide.",
    "Football, also known as soccer, is played in many countries.",
]

vectorizer = TfidfVectorizer()

# Fit and transform the documents into TF-IDF vectors
tfidf_vectors = vectorizer.fit_transform(documents) 

feature_names = vectorizer.get_feature_names_out()  # feature_names = tokens

# Print the TF-IDF vectors
for i, document in enumerate(documents):
    print(f"Document {i+1}:")
    for j, feature in enumerate(feature_names):
        tfidf_score = tfidf_vectors[i, j]
        if tfidf_score > 0: # print only tokens that exist in this document
            print(f"{feature}: {tfidf_score:.4f}")
    print()

python

# Output
Document 1:
enjoy: 0.6525
playing: 0.6525
soccer: 0.3854

Document 2:
is: 0.3838
popular: 0.5046
soccer: 0.2980
sport: 0.5046
worldwide: 0.5046

Document 3:
also: 0.3347
as: 0.3347
countries: 0.3347
football: 0.3347
in: 0.3347
is: 0.2545
known: 0.3347
many: 0.3347
played: 0.3347
soccer: 0.1977

TfidfVectorizer is a composition of CountVectorizer followed by TfidfTransformer
CountVectorizer creates a matrix that corresponds to the count of each token in that document
TfidfTransformer, on the other hand, is a transformer that takes the count matrix produced by CountVectorizer and transforms it into a TF-IDF representation.
- It calculates the Term Frequency-Inverse Document Frequency (TF-IDF) values for each term-document pair.

Historically, the combination of CountVectorizer and TfidfTransformer was commonly used in the scikit-learn library for text analysis tasks. The process involved fitting the CountVectorizer on the training data to build the vocabulary and generate the count matrix. Then, the TfidfTransformer was applied to the count matrix to compute the final TF-IDF representation.

To simplify this workflow and improve efficiency, scikit-learn introduced the TfidfVectorizer class, which combines both.

TfidfVectorizer Parameters

We have access to all parameters of CountVectorizer, such as ngram_range, analyzer, lowercase, stop_words, max_df, min_df, max_features, vocabulary

sublinear_tf: default is False. If True, then $TF(t,d) = 1 + \log_{10}(f(t,d))$ —— Explained above in

use_idf: default is True. If False, then IDF(t) = 1, and we only consider TF-IDF(t,d) = TF(t,d). —— It’s for disabling IDF Weighteining

smooth_idf: default is False. If True then $IDF(t) = log\left(\frac{1+N}{1+ df(t) }\right)+1$ , But the default is Standard IDF as $IDF(t) = log\left(\frac{N}{df(t) }\right)+1$ —— Explained above in

norm: default is l2, but we can use l1 or None. —— Explained in detail in

`3. HashingVectorizer`

BM25

Latent semantic analysis

Latent Dirichlet allocation

- Rule-based features: Hand-crafted features based on linguistic rules (e.g., "is_capitalized," "is_numeric," "ends_with_ed").
- Lexicons: Using predefined lists of words (e.g., sentiment lexicons) to assign scores.

Sparse Vector Representation

`1. Counting BoW (CountVectorizer)`

Steps:

Examples:

`1.b. N-grams`

`1.c. Character-level N-Grams`

`1.d. Binary BoW`

Disadvantages of Counting BoW

`2. Term Frequency-Inverse Document Frequency (TF-IDF)`

Steps

Term Frequency

Inverse Document Frequency

TF-IDF Final Equation

Example

TfidfVectorizer Parameters

`3. HashingVectorizer`

BM25

Latent semantic analysis

Latent Dirichlet allocation

C-TF-IDF (BERTopic)