`Intro`

Imagine being able to capture the essence of any text - a tweet, document, or book - in a single, compact representation. This is the power of embedding models, which lie at the heart of many retrieval systems.
Embedding models translates human language (or other media formats) into a format that machines can understand and compare with speed and accuracy.
The landscape of embedding models has evolved significantly over the years.
Over the past decade, the landscape of embedding models has evolved dramatically. Early techniques (Pre-2013) like Bag of Words and TF-IDF provided simple, sparse representations, but lacked nuance and context.
A pivotal moment came in 2018 when Google introduced BERT —> a transformer model to embed text as a simple vector representation, which lead to unprecedented performance across various NLP tasks.
However, BERT wasn't optimized for generating sentence embeddings efficiently. This limitation spurred the creation of SBERT (Sentence-BERT) in 2019.
Today, the embedding model ecosystem is diverse and exploded, with numerous providers offering their own implementations.
Researchers and practitioners often turn to benchmarks like the Massive Text Embedding Benchmark (MTEB) Leaderboard for objective comparisons.
https://huggingface.co/spaces/mteb/leaderboard
https://huggingface.co/spaces/mteb/leaderboard
You can use embeddings to compare different texts and understand how they relate. For example, if the embeddings of the text "cat" and "dog" are close together you can infer that these words are similar in meaning, context, or both.

`Embedding Applications`

It is useful because it means we can think about text in the vector space, and do things like semantic search where we look for pieces of text that are most similar in the vector space.

[1] Semantic Search & Information Retrieval (Including RAG):

[2] Clustering & Categorization & Topic Modeling:

[3] Text Feature Encoder for Machine Learning Algorithms: (Sentiment Analysis)

[4] Zero-Shot Classification:

[5] Paraphrase & Duplicate Detection (Text Similarity)

[6] Recommendations

[7] Anomaly detection

[8] Data Visualization in 2D/3D:

[9] Tracking Semantic Shift Over Time:

[10] Cross-Modal Retrieval and Generation (Including OpenAI CLIP):

💡

Since we have many applications, some embedding models like (Gemini), provides you with the capability to choose the Task Type when you embedding.

🔒

Note: So far the applications mentioned above are Sentence Embedding Applications. The token embeddings applications are discussed here: .

`Similarity Measures`

You can find all details about them here:

🔑

There is a wide range of possible similarity measures but the most commonly used one when talking about sentence embeddings is cosine similarity.

🔒

This is why cosine similarity is so important — it cancels out this unintended norm inflation.

The choice of similarity metric should be chosen based on the model. As an example, OpenAI suggests cosine similarity for their embeddings.

The Trick: During standard training, a loss function (like contrastive loss) is usually applied only to the final, full-size embedding. In MRL, the loss function is applied to multiple distinct "slices" of the embedding vector simultaneously during training.
If training a 768-dimensional model, the training process might simultaneously optimize the loss for standard slices like:
- The first 64 dimensions
- The first 128 dimensions
- The first 256 dimensions
- The full 768 dimensions
Because the model is forced to perform well even when only using the first $k$ dimensions, it learns to pack the most general and important concepts into those early dimensions.
At inference time, you still get the full 768-dimensional vector. However, if you need to save storage or bandwidth, you can safely truncate that vector to one of the supported smaller sizes (e.g., taking just vector[:128]) and still maintain high accuracy.
The “Matryoshka” idea is that you can pick a dimension size “nested” inside the full dimension — like Russian dolls.

👌🏻

You cannot say, I will use a lower dimension output size and save on computation. It’s the same calculations. You are just saving on storage.

Choosing a smaller embedding dimension doesn’t reduce the cost of the forward pass inside the model itself, but it can significantly speed up everything after embedding generation.

Implementation Details of MRL

MRL is generally model-agnostic. It can be applied to BERT, ViT (Vision Transformers), or ResNets.
The key to MRL's success lies in how gradients naturally accumulate during backpropagation.
Later Dimensions (e.g., 513 to 768): These dimensions only contribute to the loss term for $m=768$ . They receive standard, sparse gradient signals.
Early Dimensions (e.g., 1 to 64): These dimensions contribute to every single loss term in the summation (they are present in the 64d slice, the 128d slice, etc.).

👌🏻

Because the early dimensions are being forced to satisfy multiple objectives simultaneously, capturing enough information to minimize loss on their own, while also supporting larger vectors —> The optimization process naturally packs the most salient, high-level semantic information into these first few dimensions.

✅

Performance: MRL is remarkably robust—at low dimensions it often matches or exceeds separately-trained small models, and at large dimensions it matches the “full-size” baseline.

For example, on ImageNet, MRL can deliver up to 14x smaller embeddings with no loss in accuracy.

Interesting article: https://huggingface.co/blog/matryoshka
Another use case for MRL — Shortlisting and reranking:
- : Rather than performing your downstream task (e.g., nearest neighbor search for RAG) on the full embeddings, you can shrink the embeddings to a smaller size and very efficiently "shortlist" your embeddings. Afterwards, you can process the remaining embeddings using their full dimensionality.

We still need to continue the HF article, and read details of how this implementation happens in code and math.

The primary technical benefit in production is Adaptive Retrieval, or "funneling": 1.First Pass (Shortlisting):Perform standard Approximate Nearest Neighbor (ANN) search using only the smallest useful dimensions (e.g., the first 128 dims).21 This reduces memory bandwidth and distance calculation time significantly.22 2.Second Pass (Reranking):Fetch the full 768d vectors only for the top-$k$ candidates retrieved in the first pass. 3.Rescore: Compute precise similarities using the full vectors to determine the final ranking.

`Sentence Embedding Models`

Nowadays, there are lots of Embedding Models

Model2Vec

What is Model2vec?

Model2vec is a lightweight, high-speed embedding framework designed to generate static word embeddings from transformer models.

It is optimized for CPU-only inference and can run up to 500× faster than traditional transformer-based embedding methods.

Key Features (from Confluence Docs)

Performance

⚡ Blazing fast inference speed (CPU-only)
📏 No max token limit (unlike traditional transformers)
🧩 Parallelizable with Spark (scales across clusters)
🚀 Up to 500× faster than other embedding methods

Technical Details

Version: model2vec==0.4.1 (in production)
Available Models:
- potion-8M (8M params) → MTEB score: 38.6
- potion-base-8M → HuggingFace recommended
- Larger options: 32M & 128M parameters

Use Cases in Production

🔍 Semantic Search → Context-based discovery
🧠 Psychographics Analysis → Tagging 800M docs by groups
📊 Large-Scale Processing → Social media at Databricks
⏱ Real-time Apps → Prioritizing speed over full accuracy

Advantages

💲 Cost-effective (free)
🖥 CPU-only → no GPU required
📡 Scales with distributed systems (e.g., Spark)
⚙️ Quick deployment

Current Applications

Context Area System → Millions of docs processed
Social Media Analysis → Embedding large datasets
Psychographic Segmentation → Demographic embeddings
Content Scoring → Pipeline for content analysis

Limitations

📉 Lower MTEB scores (38.6 vs. higher-end transformers)
🔀 Quality trade-off (speed > semantic nuance)
🧪 Still under evaluation (teams testing OpenAI embeddings, Google Vertex AI, etc.)

Combined Understanding

Model2vec balances speed vs. quality.

While not achieving the deepest semantic accuracy, its extreme efficiency makes it ideal for:

Large-scale systems (millions of docs)
Real-time/near-real-time embedding
CPU-only deployments

Static vs. Dynamic Embeddings

Static Embeddings (Model2vec)

One fixed vector per word/token
Context-independent (e.g., “bank” → same vector for finance or river)
Pre-computed once → stored in lookup table
O(1) lookup → no heavy computation
Memory-efficient

Dynamic/Contextual Embeddings (Transformers)

Context-dependent (different vectors depending on usage)
Computed on-demand (full forward pass each time)
Slower inference
GPU/compute-heavy

Why Model2vec Uses Static Embeddings

⚡ Speed (lookup vs. transformer run)
📈 Scalability (millions of docs)
🛠 Deployment simplicity (CPU infra only)
⏱ Predictable, consistent latency

How the Static Vector is Chosen (e.g., “bank”)

Aggregation Process

Context Sampling → Collect many examples of “bank” (financial, river, verb, etc.)
Vector Averaging/Pooling → Combine contextual embeddings into one centroid vector
Methods Used:
- Mean pooling (most common)
- Weighted averaging (frequency bias)
- Subword aggregation (sometimes)

Effects

Captures average meaning across contexts
Skews toward most frequent usage (e.g., “financial bank”)

Limitations

❌ No disambiguation at inference time
❌ Muddled vectors for polysemous words
❌ Less effective for nuanced semantics

Model2vec’s Distillation Approach

Start with a large transformer (e.g., BERT)
Extract contextual embeddings across many contexts
Compress into static lookup tables
Trade semantic nuance for massive speed gains

Text Embeddings [1]

Intro

Embedding Applications

Similarity Measures

Tokens Embeddings vs Sentence Embedding

Token Embedding Intro

Token Embedding Models

Sentence Embedding Intro

Pooling strategies

Same Model, Different Dimension Size — Matryoshka Representation Learning

Implementation Details of MRL

Sentence Embedding Models

Bag-of-Words (BoW) (OLD)

Sentence Transformers Library

Google Gemini

OpenAI Embeddings

Llama-cpp

Ollama (Locally)

Azure OpenAI

Fake Embeddings

Mistral AI Embeddings

TensorFlow Hub

Replicate

Others

Model2Vec

What is Model2vec?

Key Features (from Confluence Docs)

Performance

Technical Details

Use Cases in Production

Advantages

Current Applications

Limitations

Combined Understanding

Static vs. Dynamic Embeddings

Static Embeddings (Model2vec)

Dynamic/Contextual Embeddings (Transformers)

Why Model2vec Uses Static Embeddings

How the Static Vector is Chosen (e.g., “bank”)

Aggregation Process

Effects

Limitations

Model2vec’s Distillation Approach

`Intro`

`Embedding Applications`

`Similarity Measures`

`Tokens Embeddings vs Sentence Embedding`

`Token Embedding Intro`

`Token Embedding Models`

`Sentence Embedding Intro`

`Pooling strategies`

`Same Model, Different Dimension Size` — `Matryoshka Representation Learning`

`Sentence Embedding Models`

`Bag-of-Words (BoW) (OLD)`

`Sentence Transformers Library`

`Google Gemini`

`OpenAI Embeddings`

`Llama-cpp`

`Ollama (Locally)`

`Azure OpenAI`

`Fake Embeddings`

`Mistral AI Embeddings`

`TensorFlow Hub`

`Replicate`

`Others`