Yusuf Elnady Logo
Back to Notes

Text Embeddings [1]

Last updated: 11/10/2025

Intro

  • Imagine being able to capture the essence of any text - a tweet, document, or book - in a single, compact representation. This is the power of embedding models, which lie at the heart of many retrieval systems.
  • Embedding models translates human language (or other media formats) into a format that machines can understand and compare with speed and accuracy.
  • The landscape of embedding models has evolved significantly over the years.
  • Over the past decade, the landscape of embedding models has evolved dramatically. Early techniques (Pre-2013) like Bag of Words and TF-IDF provided simple, sparse representations, but lacked nuance and context.
  • A pivotal moment came in 2018 when Google introduced BERT โ€”> a transformer model to embed text as a simple vector representation, which lead to unprecedented performance across various NLP tasks.
  • However, BERT wasn't optimized for generating sentence embeddings efficiently. This limitation spurred the creation ofย SBERT (Sentence-BERT) in 2019.
  • Today, the embedding model ecosystem is diverse and exploded, with numerous providers offering their own implementations.
  • Researchers and practitioners often turn to benchmarks like the Massive Text Embedding Benchmark (MTEB) Leaderboard for objective comparisons.
  • You can use embeddings to compare different texts and understand how they relate. For example, if the embeddings of the text "cat" and "dog" are close together you can infer that these words are similar in meaning, context, or both.

Embedding Applications

It is useful because it means we can think about text in the vector space, and do things like semantic search where we look for pieces of text that are most similar in the vector space.

[1] Semantic Search & Information Retrieval (Including RAG):

[2] Clustering & Categorization & Topic Modeling:

[3] Text Feature Encoder for Machine Learning Algorithms: (Sentiment Analysis)

[4] Zero-Shot Classification:

[5] Paraphrase & Duplicate Detection (Text Similarity)

[6] Recommendations

[7] Anomaly detectionย 

[8] Data Visualization in 2D/3D:

[9] Tracking Semantic Shift Over Time:

[10] Cross-Modal Retrieval and Generation (Including OpenAI CLIP):

๐Ÿ’ก
Since we have many applications, some embedding models like (Gemini), provides you with the capability to choose the Task Type when you embedding.
๐Ÿ”’
Note: So far the applications mentioned above are Sentence Embedding Applications. The token embeddings applications are discussed here: .

Similarity Measures

  • You can find all details about them here:
๐Ÿ”‘
There is a wide range of possible similarity measures but the most commonly used one when talking about sentence embeddings is cosine similarity.
๐Ÿ”’
This is why cosine similarity is so important โ€” it cancels out this unintended norm inflation.
  • The choice of similarity metric should be chosen based on the model. As an example, OpenAI suggests cosine similarity for their embeddings.

  • The Trick: During standard training, a loss function (like contrastive loss) is usually applied only to the final, full-size embedding. In MRL, the loss function is applied to multiple distinct "slices" of the embedding vector simultaneously during training.
  • If training a 768-dimensional model, the training process might simultaneously optimize the loss for standard slices like:
    • The first 64 dimensions
    • The first 128 dimensions
    • The first 256 dimensions
    • The full 768 dimensions
  • Because the model is forced to perform well even when only using the first kk dimensions, it learns to pack the most general and important concepts into those early dimensions.
  • At inference time, you still get the full 768-dimensional vector. However, if you need to save storage or bandwidth, you can safely truncate that vector to one of the supported smaller sizes (e.g., taking just vector[:128]) and still maintain high accuracy.
  • The โ€œMatryoshkaโ€ idea is that you can pick a dimension size โ€œnestedโ€ inside the full dimension โ€” like Russian dolls.
๐Ÿ‘Œ๐Ÿป
You cannot say, I will use a lower dimension output size and save on computation. Itโ€™s the same calculations. You are just saving on storage.

Choosing a smaller embedding dimension doesnโ€™t reduce the cost of the forward pass inside the model itself, but it can significantly speed up everything after embedding generation.

Implementation Details of MRL

  • MRL is generally model-agnostic. It can be applied to BERT, ViT (Vision Transformers), or ResNets.
  • The key to MRL's success lies in how gradients naturally accumulate during backpropagation.
  • Later Dimensions (e.g., 513 to 768): These dimensions only contribute to the loss term for m=768m=768. They receive standard, sparse gradient signals.
  • Early Dimensions (e.g., 1 to 64): These dimensions contribute to every single loss term in the summation (they are present in the 64d slice, the 128d slice, etc.).
๐Ÿ‘Œ๐Ÿป
Because the early dimensions are being forced to satisfy multiple objectives simultaneously, capturing enough information to minimize loss on their own, while also supporting larger vectors โ€”> The optimization process naturally packs the most salient, high-level semantic information into these first few dimensions.
โœ…
Performance: MRL is remarkably robustโ€”at low dimensions it often matches or exceeds separately-trained small models, and at large dimensions it matches the โ€œfull-sizeโ€ baseline.

For example, on ImageNet, MRL can deliver up to 14x smaller embeddings with no loss in accuracy.

  • Interesting article: https://huggingface.co/blog/matryoshka
  • Another use case for MRL โ€” Shortlisting and reranking:
    • : Rather than performing your downstream task (e.g., nearest neighbor search for RAG) on the full embeddings, you can shrink the embeddings to a smaller size and very efficiently "shortlist" your embeddings. Afterwards, you can process the remaining embeddings using their full dimensionality.

We still need to continue the HF article, and read details of how this implementation happens in code and math.

The primary technical benefit in production is Adaptive Retrieval, or "funneling": 1. First Pass (Shortlisting): Perform standard Approximate Nearest Neighbor (ANN) search using only the smallest useful dimensions (e.g., the first 128 dims).21 This reduces memory bandwidth and distance calculation time significantly.22 2. Second Pass (Reranking): Fetch the full 768d vectors only for the top-$k$ candidates retrieved in the first pass. 3. Rescore: Compute precise similarities using the full vectors to determine the final ranking.

Sentence Embedding Models

Nowadays, there are lots of Embedding Models

Model2Vec

What is Model2vec?

Model2vec is a lightweight, high-speed embedding framework designed to generate static word embeddings from transformer models.

It is optimized for CPU-only inference and can run up to 500ร— faster than traditional transformer-based embedding methods.


Key Features (from Confluence Docs)

Performance

  • โšก Blazing fast inference speed (CPU-only)
  • ๐Ÿ“ No max token limit (unlike traditional transformers)
  • ๐Ÿงฉ Parallelizable with Spark (scales across clusters)
  • ๐Ÿš€ Up to 500ร— faster than other embedding methods

Technical Details

  • Version: model2vec==0.4.1 (in production)
  • Available Models:
    • potion-8M (8M params) โ†’ MTEB score: 38.6
    • potion-base-8M โ†’ HuggingFace recommended
    • Larger options: 32M & 128M parameters

Use Cases in Production

  1. ๐Ÿ” Semantic Search โ†’ Context-based discovery
  2. ๐Ÿง  Psychographics Analysis โ†’ Tagging 800M docs by groups
  3. ๐Ÿ“Š Large-Scale Processing โ†’ Social media at Databricks
  4. โฑ Real-time Apps โ†’ Prioritizing speed over full accuracy

Advantages

  • ๐Ÿ’ฒ Cost-effective (free)
  • ๐Ÿ–ฅ CPU-only โ†’ no GPU required
  • ๐Ÿ“ก Scales with distributed systems (e.g., Spark)
  • โš™๏ธ Quick deployment

Current Applications

  • Context Area System โ†’ Millions of docs processed
  • Social Media Analysis โ†’ Embedding large datasets
  • Psychographic Segmentation โ†’ Demographic embeddings
  • Content Scoring โ†’ Pipeline for content analysis

Limitations

  • ๐Ÿ“‰ Lower MTEB scores (38.6 vs. higher-end transformers)
  • ๐Ÿ”€ Quality trade-off (speed > semantic nuance)
  • ๐Ÿงช Still under evaluation (teams testing OpenAI embeddings, Google Vertex AI, etc.)

Combined Understanding

Model2vec balances speed vs. quality.

While not achieving the deepest semantic accuracy, its extreme efficiency makes it ideal for:

  • Large-scale systems (millions of docs)
  • Real-time/near-real-time embedding
  • CPU-only deployments

Static vs. Dynamic Embeddings

Static Embeddings (Model2vec)

  • One fixed vector per word/token
  • Context-independent (e.g., โ€œbankโ€ โ†’ same vector for finance or river)
  • Pre-computed once โ†’ stored in lookup table
  • O(1) lookup โ†’ no heavy computation
  • Memory-efficient

Dynamic/Contextual Embeddings (Transformers)

  • Context-dependent (different vectors depending on usage)
  • Computed on-demand (full forward pass each time)
  • Slower inference
  • GPU/compute-heavy

Why Model2vec Uses Static Embeddings

  • โšก Speed (lookup vs. transformer run)
  • ๐Ÿ“ˆ Scalability (millions of docs)
  • ๐Ÿ›  Deployment simplicity (CPU infra only)
  • โฑ Predictable, consistent latency

How the Static Vector is Chosen (e.g., โ€œbankโ€)

Aggregation Process

  1. Context Sampling โ†’ Collect many examples of โ€œbankโ€ (financial, river, verb, etc.)
  2. Vector Averaging/Pooling โ†’ Combine contextual embeddings into one centroid vector
  3. Methods Used:
    • Mean pooling (most common)
    • Weighted averaging (frequency bias)
    • Subword aggregation (sometimes)

Effects

  • Captures average meaning across contexts
  • Skews toward most frequent usage (e.g., โ€œfinancial bankโ€)

Limitations

  • โŒ No disambiguation at inference time
  • โŒ Muddled vectors for polysemous words
  • โŒ Less effective for nuanced semantics

Model2vecโ€™s Distillation Approach

  • Start with a large transformer (e.g., BERT)
  • Extract contextual embeddings across many contexts
  • Compress into static lookup tables
  • Trade semantic nuance for massive speed gains