- The BoW model treats text as a collection of words and disregards the order and grammar of the words.
- It focuses solely on the frequency or presence of words in the given text.
- It is called a βbagβ of words because any information about the order or structure of words in the document is discarded.
- The model is only concerned with whether the word occurs in the document, not where in the document.
- Here, we will use the words
tokens,terms, andfeature namesinterchangeably. - There are different approaches to implementing BoW:
Counting BoW (CountVectorizer)Uni-GramsN-GramsCharacter-level N-GramsBinary BoW
Term Frequency-Inverse Document Frequency (TF-IDF)HashingVectorizer
1. Counting BoW (CountVectorizer)
Steps:
- Create the Corpus β> Split the text into sentences (or into paragraphs)
- Some Preprocessing (Remove Stop Words, Remove Punctuation, Lowercase)
- Tokenization β> Count each word as a token
- Create Frequency Distribution β> Count the frequency of each token
- Each element of the vector represents the count or presence of a word from the vocabulary in the document.
Examples:

fit_transform()(1) Learns the dict vocab (2) returns the document-term matrix.get_feature_names_out()returns the words of our vocabulary β (Only the words that were in the corpus.)- For efficiency, CountVectorizer returns a sparse representation:
scipy.sparse.csr_matrix- It saves memory and speeds up algebraic operations.
toarray(): converts from csr matrix to normal array
vectorizer.vocabulary_get('document')returns the index of that word in the vocabulary which is1as shown inget_feature_names_out- Example: the 2nd sentence has "document" two times, so the count of the 2βΏα΅ index is
2.
Note: In the above example, our vocab is only 9 specific words β so if we tried vectorizer.transform(['Smth completely new']).to array() β You get an array of zeros because none of these words are in my vocab.
Note: CountVectorizer requires that each token is 2 or more alphanumeric characters unless you specify analyzer=βcharβ as in Character-level N-Grams
CountVectorizer(max_features: Build a vocab of (n) number of features (tokens) β it automatically chooses the top (n) occurrences of the corpus. --
These features/tokens could be uni-grams, bigrams, or moreβ¦
lowercase: Default is True, convert all before tokenizing
documents = [
"The cat sat on the mat.",
"The dog jumped over the fence.",
"The bird is singing in the tree."
]
# Predefined vocabulary
# Output
['cat' 'dog' 'bird']
[[1 0 0]
[0 1 0]
[0 0 1]]vocabulary: the default is None and it builds by itself, but we can pass the ones we are interested in.
- If you do not provide an apriori dictionary, then the number of features will be equal to the vocabulary size found by analyzing the data.
max_df: It takes int or float [0.0, 1.0]
min_df: The opposite of max_df
# !pip install nltk
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
english_stop_words = stopwords.words('english')
vectorizer = CountVectorizer(stop_words: takes a list of words to be ignored β Default is None
It could be our own list or a prepared list of stop words from nltk.
inverse_transform: takes the document-token matrix, and reconstructs the original documents.
1.b. N-grams
- N-Grams capture the co-occurrence of consecutive words in a document.
- Instead of representing individual words as tokens, N-Grams represent sequences of N consecutive words as tokens.
- Bigrams represent pairs of words, trigrams represent triplets of words, and so on.
- This approach captures local word order and can help capture some level of context.
- If we have the sentence "The cat sat on the mat" and we consider bigrams (N=2), the N-Grams would be "The cat," "cat sat," "sat on," and "on the," "the mat."
- Similarly, for trigrams (N=3), the N-Grams would be "The cat sat," "cat sat on," "sat on the," and "on the mat.β
- The choice of the N value (e.g., bigrams, trigrams, etc.) depends on the specific task and the desired level of context to be captured.
- Bigrams tend to focus on immediate word relationships, while trigrams and four-grams capture more complex relationships.
from sklearn.feature_extraction.text import CountVectorizer
# Example documents
documents = [
"The cat sat on the mat.",
"The dog jumped over the fence.",
"The bird is singing in the tree."
]
vectorizer = CountVectorizer(# Output
['bird is', 'bird is singing', 'cat sat', 'cat sat on', 'dog jumped', 'dog jumped over', 'in the', 'in the tree', 'is singing', 'is singing in', 'jumped over', 'jumped over the', 'on the', 'on the mat', 'over the', 'over the fence', 'sat on', 'sat on the', 'singing in', 'singing in the', 'the bird', 'the bird is', 'the cat', 'the cat sat', 'the dog', 'the dog jumped', 'the fence', 'the mat', 'the tree']
[[0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 1 1 0 0 0 1 0]
[0 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 1 1 1 0 0]
[1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 1]]ngram_range: the default is (1, 1) which means only uni-grams.
- if (1,2) then means uni-grams and bigrams
- if (2,2) then means bigrams only
- if (1,3) then means uni-grams, bigrams, and trigrams
1.c. Character-level N-Grams
- TBH: Not useful β Instead of considering words, character-level N-Grams represent each token as a character.
vectorizer = CountVectorizer(1.d. Binary BoW
- TBH: Not useful β In this approach, the presence or absence of a word in a document is represented by a binary value (0 or 1).
- No consideration of the frequency.
documents = [
"I like to play soccer as I am strong",
"I enjoy playing soccer with my close close friends"
]
vectorizer = CountVectorizer(# Output
['am' 'as' 'close' 'enjoy' 'friends' 'like' 'my' 'play' 'playing' 'soccer' 'strong' 'to' 'with']
[[1 1 0 0 0 1 0 1 0 1 1 1 0]
[0 0 1 1 1 0 1 0 1 1 0 0 1]]Disadvantages of Counting BoW
- Order Independence: BoW disregards the order and grammar of words in the text, which means it cannot capture sequential information or word dependencies.
- Two sentences with different meanings have the same representation β "The food was good, not bad at all" & "The food was bad, not good at allβ
- Information Loss: BoW focuses solely on word frequencies; No knowledge of words semantics or linguistics.
- Ex: "strong" & "sturdy" are not close in distance (vector space). ββ If embedded each will have a vector [1].
- All words have the same sentence β For example, the word βsturdyβ is equally distant to βVAβ and βstrongβ.
- Vocabulary Size (High Dimensionality) If we have 20 unique words, then the vector length size is 20. Imagine having 5000 unique words!
- High-dimensional feature vectors and potential memory and computational challenges.
- Traditional neural networks cannot keep track of long-term dependencies in a sentence/paragraph.
- Despite its simplicity and limitations, Counting BoW remains a widely used technique in NLP.
2. Term Frequency-Inverse Document Frequency (TF-IDF)
- TF-IDF is a numerical representation that reflects the importance of a term in a document within a corpus.
Term Frequency (TF): It measures the frequency of a term within a document. The intuition behind TF is that the more frequently a term appears in a document, the more relevant & indicative it is to that document.
Inverse Document Frequency (IDF): It measures the rarity of a term across the entire corpus. The intuition behind IDF is that terms that are less frequent among documents are more discriminative and provide more meaningful information compared to common terms.
IDF has no effect on retrieving documents for one-term queries β IDF affects the ranking of documents for queries with at least two terms.
term frequency, and inverse document frequency. There are various ways for determining the exact values of both statistics. Steps
- Create the Corpus β> Split the text into sentences (or into paragraphs)
- Some Preprocessing (Remove Stop Words, Remove Punctuation, Lowercase)
- Tokenization β> Count each word as a token
- TF Calculation β> For each document, the TF of each token is calculated.
- IDF Calculation β> IDF is calculated for each token across the entire corpus.
- TF-IDF Calculation β> The TF-IDF score for each token in each document is computed.
- Vector Normalization β> Each output vector will have a unit norm using L2 Norm to ensure that the resulting vector representations have consistent scales.
- It helps eliminate the potential biases introduced by document length variations.
- This normalization process allows for fair comparisons between vectors, focusing on the direction of the vectors rather than their magnitudes (document length variations).
- Overall, Vector Normalization allows for fair and meaningful comparisons between documents, facilitating various text analysis tasks such as similarity search, document clustering, and classification.
Some Notations:
Term Frequency
- Raw Term Counting:
- Term Frequency with Normalization:
- We divided by the number of words in (d) so that is normalized across the entire corpus β Value between
1and0. - This normalization accounts for variations in document lengths and provides a relative measure of the frequency of the term within the document.
- Example: If the term frequency of the same word, 'computer', for doc1 is 1 million and doc2 is 2 million, at this point, there is not much difference in terms of relevancy anymore because they both contain a very high count for term 'computer'. β Using Normalization they would be mostly similar.
- We divided by the number of words in (d) so that is normalized across the entire corpus β Value between
- Logarithmic Term Frequency:
This helps to dampen the impact of highly frequent terms and reduces the skewness caused by a few dominant terms.
In this equation, f(t,d) represents the raw count of the term "t" in document "d". The logarithmic transformation is applied to the term count to calculate the logarithmic term frequency.
- Augmented Term Frequency:
Here, f(t,d) represents the raw count of the term "t" in document "d". The maximum frequency of any term in the document, , is calculated and used to normalize the term frequency.
Inverse Document Frequency
- Standard IDF:
- We want a bigger number if a word is mentioned in a few documents because it means itβs important to us.
- If N=10, and the word is mentioned in this document only β> Higher Value (For example,
10/1 = 10) - If N=10, and the word is mentioned in 7 documents β> Lower Value (For example,
10/7 = 1.43) - So, we agree that itβs good to divide the Total Number of Documents in the corpus (N) By the number of documents that contain the term (df).
- The standard IDF just adds to this equation, for the same reasons as in TF.
- By taking the logarithm, the IDF values are compressed, meaning that the differences between IDF values for different terms become more balanced.
- If N=10, and the word is mentioned in this document only β> Higher Value (For example,
log(10/1) = 1) - If N=10, and the word is mentioned in 7 documents β> Lower Value (For example,
log(10/7) = 0.155) - This compression helps prevent a few terms with very high document frequencies from dominating the IDF scores β It narrows the gap.
- What if a term appears in all documents, then log(1) = 0, and TF-IDF will be zero although TF is not ββ So we just add
1rather than ignoring totally:
- Smooth IDF:
- Itβs solely for the case of terms that appear in no documents β> We want to avoid division by 0.
- Why would it not appear in any document??? β It could be that we just have the term in apriori vocabulary!
- We added
1+to the numerator and denominator as if there's an extra document that contains every term in the corpus exactly once.
- Maximum IDF: The maximum IDF function assigns a constant value to all terms, regardless of their document frequency. It does not consider the actual document frequencies of terms. The formula for maximum IDF is as follows: IDF(t) = \log(N)
The intuition behind maximum IDF is to treat all terms equally, assuming that their rarity or importance is not influenced by their occurrence in the corpus. It simplifies the IDF calculation and can be useful in certain scenarios.
Probabilistic IDF: The probabilistic IDF function incorporates the probability of a term occurring in a document by considering the term frequency (TF) within the document. It combines the IDF and TF components to calculate a relevance score. The formula for probabilistic IDF is as follows: IDF(t) = \log\left(\frac{N - df(t) + 0.5}{df(t) + 0.5}\right)
The intuition behind probabilistic IDF is to capture the relevance of a term by balancing its presence in the corpus (IDF) and within a specific document (TF). The formula aims to provide a more nuanced measure of term importance.
TF-IDF Final Equation
Example
Q: A document has the following feqs for three terms: f(βAβ, d) = 3, f(βBβ, d) = 2, f(βCβ, d) = 1 ββ The corpus is 1000 documents ββ we have also df('A') = 50, df('B') = 1300, df('C') = 250 β What is the TF-IDF Vector for this document:
- For Term βAβ β>
TF-IDF(βAβ) = (3/6) * (log(1000/50)+1) = 0.65ββ For Term βBβ β>TF-IDF(βBβ) = (2/6) * (log(1000/1300)+1) = 0.295ββ For Term βCβ β>TF-IDF(βCβ) = (1/6) * (log(250/1300)+1) = 0.0473 - Then our TF-IDF vector (embedding) is
[1.15, 0.295, 0.0473] - If we wanna do L2 Normalization, then
= 1.239β>[1.15/1.239, 0.295/1.239, 0.0473/1.239]=[0.928, 0.238, 0.038]
from sklearn.feature_extraction.text import TfidfVectorizer
# Sample documents
documents = [
"I enjoy playing soccer.",
"Soccer is a popular sport worldwide.",
"Football, also known as soccer, is played in many countries.",
]
vectorizer = TfidfVectorizer()
# Fit and transform the documents into TF-IDF vectors
tfidf_vectors = vectorizer.fit_transform(documents)
feature_names = vectorizer.get_feature_names_out() # feature_names = tokens
# Print the TF-IDF vectors
for i, document in enumerate(documents):
print(f"Document {i+1}:")
for j, feature in enumerate(feature_names):
tfidf_score = tfidf_vectors[i, j]
if tfidf_score > 0: # print only tokens that exist in this document
print(f"{feature}: {tfidf_score:.4f}")
print()# Output
Document 1:
enjoy: 0.6525
playing: 0.6525
soccer: 0.3854
Document 2:
is: 0.3838
popular: 0.5046
soccer: 0.2980
sport: 0.5046
worldwide: 0.5046
Document 3:
also: 0.3347
as: 0.3347
countries: 0.3347
football: 0.3347
in: 0.3347
is: 0.2545
known: 0.3347
many: 0.3347
played: 0.3347
soccer: 0.1977TfidfVectorizeris a composition ofCountVectorizerfollowed byTfidfTransformerCountVectorizercreates a matrix that corresponds to the count of each token in that documentTfidfTransformer, on the other hand, is a transformer that takes the count matrix produced byCountVectorizerand transforms it into a TF-IDF representation.- It calculates the Term Frequency-Inverse Document Frequency (TF-IDF) values for each term-document pair.
Historically, the combination of CountVectorizer and TfidfTransformer was commonly used in the scikit-learn library for text analysis tasks. The process involved fitting the CountVectorizer on the training data to build the vocabulary and generate the count matrix. Then, the TfidfTransformer was applied to the count matrix to compute the final TF-IDF representation.
To simplify this workflow and improve efficiency, scikit-learn introduced the TfidfVectorizer class, which combines both.TfidfVectorizer Parameters
We have access to all parameters of CountVectorizer, such as ngram_range, analyzer, lowercase, stop_words, max_df, min_df, max_features, vocabulary
sublinear_tf: default is False. If True, then ββ Explained above in
use_idf: default is True. If False, then IDF(t) = 1, and we only consider TF-IDF(t,d) = TF(t,d). ββ Itβs for disabling IDF Weighteining
smooth_idf: default is False. If True then , But the default is Standard IDF as ββ Explained above in
norm: default is l2, but we can use l1 or None. ββ Explained in detail in
3. HashingVectorizer
BM25
Latent semantic analysis
Latent Dirichlet allocation
Rule-based features:Hand-crafted features based on linguistic rules (e.g., "is_capitalized," "is_numeric," "ends_with_ed").Lexicons:Using predefined lists of words (e.g., sentiment lexicons) to assign scores.
