Bias and Variance

Supervised machine learning endeavors to learn a mapping function, denoted as $f$ , that accurately maps input data $X$ to an output variable $Y$ . The fundamental goal is not merely to perform well on the data used for training the model, but to generalize effectively, making accurate predictions on new, previously unseen data. The true measure of a model's success lies in its performance on this unseen data, often quantified by its prediction error.

💡

The total error associated with a machine learning model is NOT a single, indivisible quantity.

Prediction errors can be decomposed into two main subcomponents of interest: error from bias and error from variance. In addition to (Irreducible Error).

💡

The tradeoff between a model's ability to minimize bias and variance is foundational to training machine learning models, so it's worth taking the time to understand the concept.

💡

Bias and Variance: Two Concepts Easy to Learn — Difficult to Master.

`Bias`

Bias: It’s the error introduced by approximating a real-world problem (which may be extremely complex) by a much simpler model.

High bias: The model makes strong assumptions and fails to capture the data’s patterns well — this is called underfitting.

Example (High Bias): Using a linear model to fit a clearly non-linear relationship.

High training error + High validation error → High Bias
(Model is too simple, underfitting)

`Variance`

Variance: It’s how much the model’s predictions change when trained on different data.

High variance: The model captures too much noise from the training data — this is called overfitting.

Example (High Variance): A deep decision tree that performs perfectly on training data but fails on test data.

Low training error + High validation error → High Variance
(model is too complex, overfitting)

`Irreducible Error (` $\epsilon$ `)`

In addition to the errors of the model (Bias) and (Variance) —> There exists a third component known as irreducible error.
It’s an error that simply cannot be reduced by any model.
Analogy (Archer): Even the best archer in the world can NOT predict a sudden, unpredictable gust of wind that occurs just as the arrow leaves the bow. This unpredictable element is like an irreducible error.
In supervised learning, we often assume the real world is messy.

`Sources of` `Irreducible Error` `inside Data`

IMPORTANT: Irreducible error stems not from the model itself, but from inherent properties of the data.

Measurement Errors: The tools and methods are often imperfect. Sensors might have limited precision, readings can fluctuate, or human error —> inaccuracies that deviate from the true underlying values.
Inherent Randomness (Noise): Many real-world phenomena possess an intrinsic element of stochasticity or randomness.
- For example, predicting the exact outcome of a coin flip or the precise movement of a stock price involves inherent unpredictability that no model can fully capture.
Unobserved Features: There may be factors or features that influence the target variable but are not measured or included in the dataset used for modeling.
- These hidden features can be the source of the noise in the relationship. We might have oversimplified the data model.

🤌🏼

It’s irreducible because it lies outside the scope of what the modeling process can control.

Improving the learning algorithm, tuning hyperparameters, or adding more data (of the same type) cannot eliminate noise originating from the fundamental nature of the problem or the data collection process itself.

🤌🏼

The irreducible error is the price of randomness — it’s nature’s way of saying: ‘No matter how smart you are, you can’t predict everything.’

It is irreducible because no model, no matter how smart, can predict randomness.

`Formula`

If we have a ground truth $y$ and a true underlying function —> then $y = f(x) + \epsilon$
We wish to create $\hat{f}(x)$ and even if it matches exactly $f(x)$ — we still have $\epsilon$ .
The output of $\hat{f}(x)$ is $\hat{y}$
$\epsilon$ is independent of new input $x$ and the training data $D$ .
Mathematically: $\mathbf{E}(\epsilon) = 0$
Mathematically $Var(\epsilon)= \mathbf{E} (\epsilon^2) = \sigma^2$
- That’s because $Var(\epsilon) = \mathbf{E}[\epsilon - \mathbf{E}[\epsilon]]^2 = \mathbf{E}[\epsilon - 0]^2 = \mathbf{E}[\epsilon]^2 = \sigma^2$

`🎯 Why Do We Assume the Noise 𝜖 Has Zero Mean` $\mathbf{E}(\epsilon) = 0$ `?`

By definition, we assume it’s pure randomness. —> If noise had a non-zero mean, it would mean there’s a systematic shift — a pattern we could model!
Another intuition: Noise = unpredictable deviations up or down, equally likely.
Imagine flipping a coin —> The noise is fair (half heads, half tails) —> Over many flips, the noise should cancel out → zero mean. Otherwise, it's not random noise — it's a systematic bias.
In standard ML and regression, zero-mean noise is the default because we want to cleanly separate randomness from predictability.

`The Bias-Variance Tradeoff`

The bias-variance tradeoff is one of the most fundamental concepts in supervised machine learning. It describes the inherent inverse relationship between the bias and the variance of a learning algorithm.
Generally, actions taken to decrease a model's bias tend to result in an increase in its variance. Conversely, efforts to reduce a model's variance often lead to an increase in its bias.
The goal is not to eliminate either bias or variance completely, but rather to find the "sweet spot”.

`Target Shooting Analogy`

Your goal is to shoot arrows at the center of a target (the bullseye = true function $f(x)$ )

Concept	What it Means	Archer Analogy
Bias	How far your average aim is from the true center.	If you are systematically aiming off-center (e.g., all your arrows land near the top-right of the target), you have high bias.
Variance	How much your shots spread out around your average aim.	If your arrows are scattered widely all over, you have high variance. If your arrows are tight together, you have low variance.
Irreducible Error	Random unavoidable noise in the system.	Even if you're perfect, some slight wind or hand shake could cause slight randomness.

Bias	Variance	What Happens	Description
Low Bias	Low Variance	Arrows tightly around bullseye	You're accurate and consistent — ideal model!
High Bias	Low Variance	Arrows tightly grouped but far from center	You're consistent but wrong — like a bad model that's underfitting
Low Bias	High Variance	Arrows centered on bullseye but spread out	You aim right, but you're inconsistent — like a model overfitting different data
High Bias	High Variance	Arrows scattered and off-center	You're both wrong and inconsistent — worst case

`Curve Fitting Analogy`

Even though we only discuss Bias-Variance from a regression perspective, keep in mind that the practical implications of the bias-variance tradeoff are applicable to all supervised learning contexts.

`Test Analogy`

A nervous student (high variance) panicking at every test.
A confident-but-wrong student (high bias) repeating the same mistakes.

`Algorithm Tuning`

Many algorithms have parameters that directly control complexity and thus navigate the tradeoff.

K-Nearest Neighbors (k-NN): A small k leads to low bias/high variance; increasing k increases bias/decreases variance.
Support Vector Machines (SVM): The regularization parameter C controls the tradeoff. High C leads to low bias/high variance; low C leads to high bias/low variance.

`KNN Example`

KNN is explained here:
Remember bias pays very little attention to the training data and oversimplifies the model.
If $K=1$ , and we have a test point —> the model will look at the closest training point, and copies its class, instead of looking at the closest $K$ and choose from them.
We call $K=1$ an overfitting (low Bias, High Variance), because of the following 4 reasons:
1. Memorizes noise: If a noisy label is close to a test point, it will copy it exactly.
2. Highly sensitive to tiny changes: Adding or removing one point can change the prediction.
3. Complex decision boundaries: If you color later the areas based on the classifications, you will find the boundary flips often between classes — jagged, unstable.
4. It achieves 0 training error: Because when we want to classify a training point, it looks around, Oh I can use myself because I am training datapoint, so it picks itself, but if larger k it would consider itself and others.
If Large K (e.g., $k=68$ ):
- Even if a minority class dominates locally, the large majority from far away may override it.
- Model is stable, but can miss patterns (underfit) (High Bias, Low Variance) ; stable because almost always predicts the same class since the majority voting always contains all training points.
If Intermediate K ( $5 ≤ K ≤ 20$ ):
- Good balance between locality and smoothing, ✅ Moderate bias, ✅ moderate variance
- Sweet spot of generalization

`Bias Detailed`

`In Statistics`

Bias(\hat{\theta}) = \mathbf{E}[\hat{\theta}] -\theta

Remember in statistics: A parameter is a number describing a whole population (e.g., population mean), while a statistic is a number describing a sample (e.g., sample mean).
- The statistics (Sample Mean, Sample Standard Deviation (Standard Error), etc) are estimators of the parameters.
- We want to ensure they accurately represent the population, so we find and derive the Unbiased Estimators.
- Any statistic $\hat{\theta}$ is an Unbiased Estimator for a parameter $\theta$ if: $E[ \hat{\theta} ]=θ$ —> Which means the expected value of the estimator is exactly the population parameter.
- In a nutshell: we want to check if we have multiple sample datasets (size=n) —> the average (expected value) of their statistic will be as our population parameter.
- More on deriving Unbiased Estimators at: ,

`In ML (Supervised Models) (Theoretical Definition)`

{Bias}(\hat{f}(x)) = \mathbb{E}[\hat{f}(x)] - f(x)

We still have the same Bias definition: it’s the difference between the average prediction of a model and the true value.

Same as we studied in irreducible error: Our Goal is to learn the true function $f(x)$ and create our own model $\hat{f}(x)$ because $f(x)$ is unkonw usually.
- Remember that the true value is $y=f(x)+ϵ.$
We use a specific training dataset $D$ to train a model, which results in a learned function $\hat{f}(x;D)$ — Often just written as $\hat{f}(x)$ .
- Our learned function depends entirely on the specific random training dataset $D$ .
- If another $D$ , we’d get a slightly different $\hat{f}(x)$ .
Now, let's focus on a new single specific input point $x$ . We test it using our model $\hat{f}(x)$ which is trained on a specific dataset D.
- IMPORTANT CRUCIAL: Since the model / estimator of this single point depends on the random dataset D, then the prediction $\hat{f}(x)$ at this specific point x is also a random variable (Because we can retrain the model on many different possible datasets $D$ ).
Imagine if we estimated $\hat{f}(x)$ after we trained different versions of the model on all possible training datasets D of the same size.
- We can call the average of these predictions $\mathbb{E}_D[\hat{f}(x)]$ or simply $\mathbb{E}[\hat{f}(x)]$
Finally: The Point-wise Bias: The bias of machine learning model $\hat{f}$ at the specific point $x$ is ${Bias}(x) = \mathbb{E}[\hat{f}(x)] - f(x)$ .

ℹ️

Bias here measures: On average, how wrong is your model at a point x?
Bias represents a systematic error inherent in the model. The model's tendency to consistently miss the true value, regardless of the particular training data used.
This systematic deviation arises from the model's inherent limitations or the assumptions it makes about the data.

🤌🏻

Where this "many training datasets" idea comes from?

When we talk about bias and variance formally in machine learning, we imagine (theoretically) that: There is a huge population of possible training datasets you could sample from the real world.

IMPORTANT: Theoretical bias and variance exist for mathematical understanding, but practical ML uses validation sets, cross-validation, regularization, etc., to control bias/variance WITHOUT directly computing them.

`📖 Why can't we directly compute the expectation? Why the Bias Calculation is only theoretical in ML?`

It is important to distinguish between the theoretical definition of bias and variance and how it is diagnosed in practice. The formal definition involves calculating the expected value over an infinite number of hypothetical training sets drawn from the true data distribution. This theoretical quantity measures the inherent instability of the learning algorithm given the data distribution and sample size.

We'd need to retrain the model on all possible datasets — an infinite process.

We'd need access to the true function f(x)f(x)f(x) — impossible in real life.

We'd need to sample infinite noise patterns $\epsilon$ — again impossible.

🔥

In practical ML workflows, practitioners typically work with only one or a limited number of training datasets. It is impossible to directly compute the expectation over all possible datasets.

Solution: variance and bias are diagnosed indirectly by observing the model's performance.

We discuss this here:

⚠️ Important Note: Unbiased ≠ Best

An estimator (statistic) $\hat{\theta}$ is unbiased for a parameter $\theta$ if: $E[ θ^ ]=θ$
An unbiased estimator might still be bad if its variance is huge — that is, the estimates are all over the place.
Key to Understand: like a dart thrower whose arrows are wildly scattered (High Variance) but on average centered (Unbiased).
- He might miss the bullseye each time but the average landing spot is exactly at the center.
We will use MSE to evaluate both bias and variance.

✅

Rule of Thumb:

❗ Unbiased does not mean optimal. ✅ Optimal = lowest MSE, even if biased.

`Variance Detailed`

Var(\hat{\theta}) = \mathbf{E}[(\hat{\theta}-\mathbf{E}(\hat{\theta})^2]

Var(\hat{f}(x)) = \mathbf{E}[(\hat{f}(x)-\mathbf{E}[\hat{f}(x)])^2]

This is just the formal writing of a standard deviation becoming variance —> $\sigma^2 = \sum\frac{(x_i-\bar{X})^2}{n}$
Formally: The variance of a model's prediction $\hat{f}(x)$ at point $x$ is the expected squared difference between the prediction made by a model trained on a specific dataset D and the average prediction over all possible datasets D.
- Same multiple training dataset definitions as we did in Bias
Conceptually: High variance is synonymous with high model sensitivity to the training data. It is sensitivity to fluctuations, noise within the training data ; small changes in the training set can lead to significantly different learned models and predictions.
An overfit model fails to generalize because it has learned patterns that do not exist in the broader data distribution, mistaking noise for signal.
A significant gap between low training error and high validation error is interpreted as a strong indicator of high variance.

`Mean Squared Error (MSE)`

MSE is a criterion tries to take into account both concerns (Bias) and (Variance).
MSE is a way to measure the goodness of the estimator.
MSE is the average of squared errors.

Scenario: Choosing Between Two Estimators 🎭

Imagine we have two models, which one would you prefer? Isn't "unbiased" always better?
Estimator Bias Variance
Estimator A ✅ Small bias ✅ Small variance
Estimator B ✅ Zero bias (Unbiased) ❌ Very high variance
Estimator A: Slightly biased but very stable.
Estimator B: Unbiased but very unstable.
Actually we would prefer Estimator A! Because small bias + low variance → often better than zero bias + huge variance.
To make it easier, we actually need to minimize MSE!

`Types of MSE`

IMPORTANT: MSE is the Expected Squared Error — but Expected over what exactly?

In truth, there are several distinct flavors of theoretical MSE. Yet, in much of the machine learning literature, authors often refer to "MSE" without clearly specifying which version they mean.

First Type of MSE: Fixed Single Test Input $x$ (Training Randomness Only) → We evaluate the performance of a model at a specific test point $x$ , by imagining many possible training datasets.
- Here, the randomness comes only from the training dataset.
- We ask: "If we trained on different datasets, how would the model behave at this exact $x$ ?"
Second Type of MSE: Fixed Single Trained Model (Input Randomness Only) → We evaluate the performance of a single trained model, by measuring its error across different input points drawn from the data distribution.
🔥
This version is often used in practical machine learning, where we assume the model is fixed and test performance varies with the input data.
Third Type of MSE: Full Expectation (Training and Input Randomness) → We evaluate a model’s performance over both sources of randomness
- This involves a double expectation: first over training datasets, then over input points.

The good news is that, regardless of which MSE flavor you are faced with, they all admit a decomposition into bias-squared, variance, and noise.

Derivation (Using Training Dataset Randomness)

$MSE =\mathbf{E}[(y− \hat{f} (x))^2 ]$
We have $y=f(x) + \epsilon$ , so $MSE =\mathbf{E}[(f(x) + \epsilon− \hat{f} (x))^2 ]$
Rearrange the terms inside the parenthesis: $\mathbf{E}[((f(x)− \hat{f} (x))+ϵ)^2 ]$
Expand the square (a+b)^2 =a^2+2ab+b^2: $\mathbf{E}[(f(x)− \hat{f}(x)) ^2 +2ϵ(f(x)− \hat{f}(x))+ϵ^ 2 ]$
Use the linearity of expectation $\mathbf{E}[(f(x)− \hat{f} (x))^ 2 ]+\mathbf{E}[2ϵ(f(x)− \hat{f} (x))]+\mathbf{E}[ϵ ^2 ]$
1. Since $\mathbf{E}[ϵ]=0$ , we have $Var(\epsilon) = \mathbf{E}[\epsilon - \mathbf{E}[\epsilon]]^2 = \mathbf{E}[\epsilon - 0]^2 = \mathbf{E}[\epsilon]^2 = \sigma^2$
2. For the middle term: $\mathbf{E}[2ϵ(f(x)− \hat{f} (x))] = 2\mathbf{E}[ϵ]\mathbf{E}[f(x)− \hat{f} (x)]$ because the noise of point (x,y) is independent of training data $D$ and the function $f(x)$ . Since $\mathbf{E}[\epsilon]$ is zero, so the whole term is ZERO.
We are still remained with the piece $\mathbf{E}[(f(x)− \hat{f} (x))^ 2]$
1. We now (a-b)^2 = a^2-2ab+b^2 —> $\mathbf{E}[(f(x)^2 + 2 f(x) \hat{f}(x) + \hat{f}(x)^2]$
2. Use the linearty of expectation: $\mathbf{E}[(f(x)^2] + 2 \mathbf{E}[f(x) \hat{f}(x)] + \mathbf{E}[\hat{f}(x)^2]$
3. We know that $f(x)$ is a fixed value the (True function at x), so —> $f(x)^2 - 2 f(x)\mathbf{E}[ \hat{f}(x)] + \mathbf{E}[\hat{f}(x)^2]$
4. Add and Subtract $\mathbf{E}[\hat{f}(x)]^2$ —> $f(x)^2 -\mathbf{E}[\hat{f}(x)]^2 + \mathbf{E}[\hat{f}(x)]^2- 2 f(x)\mathbf{E}[ \hat{f}(x)] + \mathbf{E}[\hat{f}(x)^2]$
5. Now the trick is we can create Bias^2 and Variance from step (d)
  1. $f(x)^2 + \mathbf{E}[\hat{f}(x)]^2- 2 f(x)\mathbf{E}[ \hat{f}(x)]$ is just $(f(x) - \mathbf{E}[\hat{f}(x)])^2$ which is the $Bias(\hat{f}(\theta))^2$
  2. We are remained with $-\mathbf{E}[\hat{f}(x)]^2 + \mathbf{E}[\hat{f}(x)^2] = \mathbf{E}[\hat{f}(x)^2] - \mathbf{E}[\hat{f}(x)]^2$ which is the definition of $Var(\hat{f}(\theta))$
All in all 5.a and 6.e: $MSE = Var(\hat{f}(\theta)) + Bias(\hat{f}(\theta))^2 + \sigma^2$

Why the irreducible error is the lower bound on the expected prediction error?

Non-Negative Components (MSE):
- $Bias^2$ is always (≥0) because it's a squared value.
- Variance $Var(\hat{f}(x))$ is always (≥0) by definition.
- Irreducible error has a variance $σ^2=Var(ϵ)$ which is also always (≥0). In most real-world problems, there's some noise, so $σ^2 >0$ .
The Lower Limit: If Bias and Variance are zero, the expected minimum error (MSE) would satisfy: $E[(y−\hat{f}(x))^2]≥0+0+σ^2$

No model, no matter how sophisticated or perfectly trained, can achieve an expected squared error lower than $σ^2$ (at point x, averaged over datasets and noise).

`Modern Machine Learning Twist 🧠 (Double Descent Phenomenon) (The end of the bias-variance trade-off?)`

🔥

✅ In modern ML, especially deep learning, things are a little surprising:

Very large models (millions or billions of parameters) can still generalize well if trained carefully.
This goes against the "classical" view where too much complexity must cause overfitting.

The term Double Descent was coined in Dec 2018 in Reconciling Modern ML Practice and the Bias–Variance Trade-off, then in Dec 2019 OpenAI published a paper and article Deep Double Descent that shows that effect in CNNs, ResNets, transformers and epoch-wise training.
- Since then Dozens of papers (e.g. “On the Role of Optimization in Double Descent”, “Double Descent Demystified”) extend the phenomenon to linear regression, random forests, kernel methods and analyze implicit regularization of SGD.
Resources: , ,
The Second Descent reveals behavior of (extreme overparameterization) that is not typically analyzed classically. We observe strong test performance from very overfit, complex models.
It’s counterintuitive because you would expect DL training to be monotonic.

`[1] The Classical Regime (Under-parameterized:` $P<N$ `) (U-Curve): Underfitting — Sweet Spot — Overfitting`

Classical Tradeoff: increasing model complexity (e.g., adding parameters) decreases bias (the model becomes more flexible and can approximate the true function better) but increases variance (the model becomes more sensitive to the specific training points, including noise).
Classical Tradeoff leads to the classical U-shaped curve for test error vs model complexity.
The Blue Line (Test Error):
1. Underfitting Region: Initially, increasing complexity improves test performance (lower bias).
2. Sweet Spot (Optimal Balance): There's a balance where test error is minimized.
3. Overfitting Region: Keep increasing complexity and test error starts to climb. (This is more cumbersome than underfitting)

The Red Line (Train Error):
- It keeps decreasing until reaching the interpolation threshold.
- Interpolation Threshold: The Point where training error first becomes exactly zero. (High Variance)
- Interpolation Threshold: Means the models is so wiggly and the model has perfectly fit and memorized the entire training set passing through each datapoint.
- Interpolation Threshold: Mainly deals with the training error being ZERO, in this case the Test Error can be at its peak.

https://medium.com/@fernando.dijkinga/double-descent-the-surprising-phenomenon-challenging-deep-learning-theory-679a8440d37f

`[2] What Happens at the Interpolation Threshold?` `(` $P≈N$ `)`

Interpolation literal meaning here is that our model has interpolated between every single training point; It’s drawing a curve that runs through every single training point.
$P =$ Number of Parameters, $N =$ Number of Training Samples

At the interpolation threshold, there can often be a single choice of model that works, and there is no reason to believe that model will be good for production.

At this threshold we find $P = N$ , but why that makes the training error zero? Why do we have only one way to interpolate the data?
Let’s use polynomial regression as an example:
- You have $5$ training datapoints $(x_1, y_1), (x_2, y_2), (x_3, y_3), (x_4, y_4), (x_5, y_5)$ .
- You have only $1$ input column / feature.
- To have $P = N$ , then we need $5$ coefficients in the model —> Which means a polynomial function of degree 4.
- Our model is $y = \beta_0 + \beta_1x^1 + \beta_2x^2 + \beta_3x^3+\beta_4x^4$ OR $y = a_0 + a_1x^1 + a_2x^2 + a_3x^3+a_4x^4$
- So it’s like we have $5$ equations (How many samples) with $5$ unknowns coefficients ( $a_0, a_1, a_2, a_3, a_4)$
- So (in general, unless the matrix is badly degenerate), one unique solution.
  - Degenerate and Matrix Inverses are discussed here:
This is why at the interpolation threshold we have only ONE SINGLE SOLUTION. That unique solution can be weird, wiggly, and overfit.
- In the next phase (Over-Parameterized Regime), we will have multiples of solutions to choose from.
What if we have two features, how can we do Polynomial Regression Up to $N = P$ ?
- It’s so simple —> $f(x_1, x_2) = y = a_0 + a_1x_1 + a_2x_2 + a_3x_1^2 + a_4x_1x_2 + a_5x_2^2 + \cdots$
- We expand to many polynomial terms (combined features), and adjust the coefficients over those expanded features.
- Terms are powers and cross-products of $x_1$ and $x_2$ .

`[3] The Modern Interpolating Regime (Over-parameterized:` $P≫N$ `)`

After the interpolation threshold, As the model complexity $P$ increases —> the curve does not continue to rise indefinitely, instead the second descent happens in the Test Error —> We get the Double Descent Curve!
It’s the characteristic of many successful modern deep learning models.
The Double Descent Phenomenon is not confined to a specific niche model or dataset.
- It has been proved to apply in linear regression, ridge regression, Neural Networks, Convolutional Neural Networks (CNNs), Residual Networks (ResNets), Transformers, Decision Trees and Ensemble Methods.
- It has been tested on various datasets, including standard benchmarks like CIFAR-10 and CIFAR-100 (often with added label noise to amplify the effect) , ImageNet , MNIST , as well as synthetic datasets (e.g., Gaussian data for theoretical analysis) and other real-world regression and classification datasets

The question is why does it work when $P≫N$ ?

Empirically بالتجربة this is well proven to happen, but theoretically why it happens is still an active area of research 😮
Existing explanations often rely on simplified models (like linear regression or random features).

Unsupported block type: toggle

A quick simple analogy (IMPORTANT): You have 10 points to fit, and you have exactly 10 sticks (parameters).

Implicit Regularization and Optimization Bias

This is the most central theme in explaining the second descent.

After the interpolation threshold, every model onwards passes through each training data point.
Now there are infinitely many models that can fit the training data perfectly.
Some of these solutions are ugly, but some are beautiful:
- The only thing that changes is how the model connects the in-between points.
- As the models become more and more complex, these connections can become smoother, and the resulting prediction may fit your test data better.

🔥

Optimizers like SGD "choose" the nicer solutions (smooth, low-norm) naturally from the infinite solutions. But HOW? The details are here:

In general, it leans towards solutions of (Minimum L2 norm (simpler weights) zero-residual solution, Flatter minima (less sharp loss surface), Smoother functions).

🔥

Two Core Principles:

1. At the interpolation threshold, there can often be a single choice of model that works.

2. In the limit of infinitely large models, there will be a vast number of interpolating models, and we can pick the best amongst them.

`Factors Influencing Double Descent`

The precise shape, peak location, and height of the double descent curve are not fixed but are influenced by the following factors (the model, the data, the training process, the regularizaiton, sparsity, and noise).

[1] Model Architecture and Size ( $P$ ) Model-wise Double Descent

This is the most frequently studied form —> varying the size of the model (often the network width) while keeping the dataset and training procedure fixed.
Increasing $P$ is a necessity to move us to the Over-parameterized region.
Increasing network width (Number of Neurons in Layer) often leads to the canonical double descent behavior.
In contrast, increasing network depth (Number of Layers) beyond a certain point, while also increasing $P$ , has been observed in some studies (e.g., with ResNets) to worsen test performance monotonically.
Note: Architectural Choices induce different implicit biases or optimization landscapes, not simply increasing $P$ .

[2] Dataset Size ( $N$ ) Sample-wise Non-monotonicity

Describes a state where more samples hurts the performance of the model.
Increasing N shifts the double descent peak towards larger model sizes P.
Interesting: If $P$ is fixed, increasing N can push the model away from the second descent towards the interpolation threshold —> This highlights the need to co-evolve model size and dataset size.

[3] Training Duration (Epochs) Epoch-wise Double Descent

There is a regime where training longer reverses overfitting.
Continue training even if overfitting (peak error), and potentially a second descent.
Don’t do Early Stopping on Test / Validation dataset — It can prevent the double descent curve, making you stop near the first minimum.
The existence of epoch-wise double descent complicates the standard practice of early stopping based solely on validation error increase
https://arxiv.org/pdf/1912.02292

[4] Regularization: Regularization-wise Double Descent

Regularization = Any method that encourages the model to avoid overfitting and generalize better.
Explicit Regularization: You add it manually (in loss or architecture)
- L2 weight decay (Ridge), Dropout, Early stopping, Data augmentation
Implicit Regularization / Implicit Bias: It happens naturally from training dynamics
- The optimizer Choice (e.g., SGD, Adam, RMSprop), Initial weights, Batch Normalization, Residual Connections
Explicit Regularization is a factor because it can flatten or remove the harmful test error peak at the interpolation threshold, Making the transition from under- to overparameterized smoother
Implicit Regularization is a factor because this implicit bias guides the selection from the many interpolating solutions in the P>N regime, favoring smooth curves and low-norm weights.

[5] Sparsity:

Sparsity means: many parameters (weights) are zero —> You "prune" (remove) neurons, weights, or filters from a network after training (or even during training).
Intuition says: fewer parameters = lower capacity → less overfitting, more generalization.
In Double Descent, Sparsity is less common mentioned, but surprisingly it follows different approach: Worsen —> Improve —> Worsen

When you gradually increase sparsity in a trained model:

Stage	Behavior	Why
Low sparsity (small pruning)	Test performance worsens first	Maybe you're deleting useful redundancies, disrupting the network's fine-tuned balance.	Small amounts of pruning can break important structures inside the network.
Moderate sparsity	Test performance improves	Now sparsity acts like strong regularization, suppressing overfitting, letting true patterns dominate.	regularizes nicely.
Extreme sparsity	Test performance worsens again	Now you're underfitting — model can't even represent the true signal anymore.	damages the ability to learn at all (underfitting).

[6] Data Noise Levels

Label noise = some training examples have the wrong labels. (For example, an image of a "cat" is mistakenly labeled "dog.")
The model doesn’t know which labels are wrong — it tries to fit everything.
With noisy labels, to achieve perfect fit (zero training error), the model is forced to contort itself — bend unnaturally — to also fit the wrong labels.
Even without artificial noise, real-world complex datasets (like CIFAR-10, ImageNet) naturally have label noise or label ambiguity. Thus, even clean-looking datasets can exhibit double descent — but adding noise makes the effect much more dramatic and easier to observe.
Label Noise in training data amplifies the peak of the double descent curve, because the model is now worse on predicting test data, and just memorized wrong data.

🔥

Finally, if there is one thing both classical statisticians and deep learning practitioners agree on is “more data is always better”

Factors Influencing Double Descent (DD) Characteristics

Factor	Typical Effect on DD Curve	Notes
Model Size (Width ↑)	Increasing the model complexity, especially width (Number of Nodes per Layer) moves along the x-axis; so we observe the full DD.
Model Size (Depth ↑)	Same as Width, but may degrade performance beyond a point; therefore second descent not guaranteed	Behavior is architecture-dependent (e.g., ResNets often handle depth better)
Dataset Size (N ↑)	Shifts peak to higher P, because more P needed until interpolation;	It changes location of interpolation threshold (P≈N) and can temporarily increase error if P is fixed near critical regime
Training Epochs ↑	Traverses epoch-wise DD curve; reverses overfitting in large models.
Explicit Regularization ↑ (e.g., L2, Dropout)	Reduces/eliminates peak height; smooths curve	This is because these methods are in the first place to prevent overfitting, so we should not have interpolation point at all, therefore no DD, but monotonic smooth curve.
Implicit Optimization (Optimizer Choice)	Alters curve shape via different implicit biases	Different optimizers (e.g., SGD vs. Adam) favor different solution regions. Impacts generalization and DD visibility.
Sparsity ↑	Produces non-monotonic “sparse DD”: error ↑ then ↓ then ↑ again	Initially hurts (damages structure), then helps (regularization), finally harms again (underfitting).
Label Noise ↑	Amplifies the interpolation peak; worsens test error near threshold	Forces models near 𝑃≈𝑁 to fit incorrect labels, yielding unstable, poor solutions. Makes DD behavior more pronounced.

`🔥 How we indirectly detect (diagnose) Bias and Variance in practice`

Training vs Validation Learning Curves: Plotting the model's performance (e.g., loss or accuracy) vs. training time (epochs) or dataset size.
- High Bias:
  - Both curves have high error level. They are close together, but the error is unacceptably high. Increasing training time or data doesn't help if the model is simple.
- High Variance:
  - There's a significant gap between the two curves.
  - One way to decrease variance would be to add more training data. However, there is a price to pay: we will have less test data.
- Good Fit: Both converge at a low error level with only a small gap between them.

https://analystprep.com/study-notes/cfa-level-2/quantitative-method/overfitting-methods-addressing/

⭐

Resubstitution Error

Error obtained from Training Data.
You train and test on same data to get the Best Case Scenario. It’s optimistic measure.
If Low Resubstitution Error + High Test Error —> Overfitting (High Variance)
If High Resubstitution Error + High Test Error —> Underfitting (High Bias)
Resubstitution Error is Upper bound on bias, lower bound on variance —> It tells you little about how predictions will fluctuate on new data; variance is invisible here.

Model Complexity:
- Very simple model (linear regression for a complex problem, shallow neural network, low-degree polynomial) and seeing poor performance everywhere, high bias is a likely suspect.
- Very complex model (very deep network, high-degree polynomial, decision tree with no depth limit) without sufficient regularization, and a large gap between training and validation performance, then high variance.
Cross-Validation Results:
- If performance is consistently poor across all folds of cross-validation, it points towards high bias.
- If performance varies drastically between different folds (very good on some, poor on others), it suggests sensitivity to training data subset, indicating high variance.
- More details about it here:
- ✅ Always evaluate across multiple seeds or folds.

`🔥 Managing Bias and Variance: Techniques and Strategies`

`Techniques to Manage Bias`

Use More Complex Models: Switch to a More Powerful Model or Increase Complexity within the Model
Decrease Regularization:
- Regularization methods (like L1, L2, dropout) are primarily used to combat overfitting by penalizing model complexity. If a model is underfitting, it might be because the regularization is too strong.
- It gives the model more freedom to fit the data. Removing regularization entirely might also be considered.
Train Longer or Increase Training Data (Use with Caution)
- For iterative algorithms like neural networks trained with gradient descent, underfitting might occur if the training process is stopped too early before the model has converged.
- Learning curves are essential to look at.

Feature Engineering / Embeddings: The input features do not contain enough information to predict the target accurately
- Replace Static word2vec with Context-Aware BERT/RoBERTa/LLM embeddings
- Train embeddings jointly with task: Don’t freeze pre-trained embeddings; let them adapt to your task
- Collect or derive additional relevant features that capture more aspects of the problem. Domain expertise is often crucial here.
- Create new features from existing ones, such as interaction terms (products of features)
Boosting: Trains models sequentially, with each new model focusing on correcting the errors made by the previous ones.
- Examples include AdaBoost, Gradient Boosting Machines (GBM), XGBoost, LightGBM.
- Each new model tries to fix the errors (biases) of the previous one.
- While primarily aimed at reducing bias by combining weak learners, variance can increase, especially if base learners are too deep or too many rounds are used.
- Again, it might increase variance if it keeps fitting harder and harder examples.

`Techniques to Manage Variance`

Feature Selection / Dimensionality Reduction:
- Overfitting can occur if the model uses too many features, especially irrelevant or noisy ones.
- Feature Selection: Identify and remove features that have little predictive power or are redundant. Techniques like using L1 regularization or statistical tests can aid selection.
- Dimensionality Reduction: Algorithms like PCA that project the data onto a lower-dimensional space while preserving most of the variance.
Regularization: L1/L2 penalties, weight decay, dropout, attention dropout
- More about it in Supervised Machine Learning Course and Deep Learning Course
- In Linear Regression, when we have $\theta_0 + \theta_1x+\theta_2x^2$ it’s a good model (perfect), but if we got $\theta_0 + \theta_1x+\theta_2x^2 + \theta_3x^3+\theta_4x^4$ we will need to decrease $\theta_3$ and $\theta_4$ (Penalize them), so regularize)
Early Stopping: Stop training before the model memorizes noise
- This was the case before discovering Double Descent.
Increase Training Data
- Often the most effective way to combat overfitting.
- More data provides a clearer picture of the underlying patterns and makes it harder for the model to fit random noise specific to a small sample.
Data Augmentation: If acquiring more real data is difficult, artificially expand the training set.
- E.g., rotating/cropping images, adding slight noise, paraphrasing text

Reduce Model Complexity:
- For neural networks, reduce the number of layers or neurons. For decision trees, prune the tree (limit depth or number of leaves). For polynomial regression, reduce the degree. Use smaller kernels in SVMs.
Ensembling (Averaging): Combine outputs from multiple trained models (e.g., random forests, deep ensembles), often significantly reducing variance
- Learn more here:
- Bagging (Bootstrap Aggregating): Trains multiple instances of a base learner (often complex ones like decision trees) on different bootstrap samples (random samples with replacement) of the training data and averages their predictions. Random Forests are a prime example. Bagging primarily reduces variance.
- Stacking (Stacked Generalization): Trains multiple different types of base models and uses another model (a meta-learner) to learn how to best combine their predictions. Aims to leverage the diverse strengths of different algorithms. It works for bias and variance, depending on meta-lerner.

⚖️

These techniques are often interconnected and their effects are not always isolated to either bias or variance.

Regularization, while primarily targeting high variance , inherently introduces some bias by constraining the model; finding the right regularization strength is key. Conversely, reducing regularization to combat high bias can increase variance.
Feature selection reduces model complexity and thus variance , but removing features that hold valuable information, even if subtle, can increase bias.

`Other Losses - Decomposition of Bias and Variance`

While the neat algebraic decomposition of MSE into Bias and Variance doesn't directly carry over to all other loss functions, but the bias and variance and their trade-off, still absolutely apply.
The loss functions are defined differently, so the mathematical steps used to decompose the expected error simply don't result in the same clean, additive $Bias^2+Variance+σ^2$ structure.
But because in ML we focus on Systematic Error (High Bias) or Sensitivity to Data (High Variance) regarding of the decomposition, the tradeoff still apply.
Regardless of the loss definition, we detect and diagnose Bias and Variance here:

Estimator	Bias	Variance
Estimator A	✅ Small bias	✅ Small variance
Estimator B	✅ Zero bias (Unbiased)	❌ Very high variance

Bias and Variance

Bias and Variance

Bias

Variance

Irreducible Error (ϵ\epsilonϵ)

Sources of Irreducible Error inside Data

Formula

🎯 Why Do We Assume the Noise 𝜖 Has Zero Mean E(ϵ)=0\mathbf{E}(\epsilon) = 0E(ϵ)=0?

The Bias-Variance Tradeoff

Target Shooting Analogy

Curve Fitting Analogy

Test Analogy

Algorithm Tuning

KNN Example

Bias Detailed

In Statistics

In ML (Supervised Models) (Theoretical Definition)

📖 Why can't we directly compute the expectation? Why the Bias Calculation is only theoretical in ML?

⚠️ Important Note: Unbiased ≠ Best

Variance Detailed

Mean Squared Error (MSE)

Types of MSE

Modern Machine Learning Twist 🧠 (Double Descent Phenomenon) (The end of the bias-variance trade-off?)

[1] The Classical Regime (Under-parameterized: P<NP<NP<N) (U-Curve): Underfitting — Sweet Spot — Overfitting

[2] What Happens at the Interpolation Threshold? (P≈NP≈NP≈N)

[3] The Modern Interpolating Regime (Over-parameterized: P≫NP≫NP≫N)

Factors Influencing Double Descent

🔥 How we indirectly detect (diagnose) Bias and Variance in practice

🔥 Managing Bias and Variance: Techniques and Strategies

Techniques to Manage Bias

Techniques to Manage Variance

Other Losses - Decomposition of Bias and Variance

`Bias`

`Variance`

`Irreducible Error (` $\epsilon$ `)`

`Sources of` `Irreducible Error` `inside Data`

`Formula`

`🎯 Why Do We Assume the Noise 𝜖 Has Zero Mean` $\mathbf{E}(\epsilon) = 0$ `?`

`The Bias-Variance Tradeoff`

`Target Shooting Analogy`

`Curve Fitting Analogy`

`Test Analogy`

`Algorithm Tuning`

`KNN Example`

`Bias Detailed`

`In Statistics`

`In ML (Supervised Models) (Theoretical Definition)`

`📖 Why can't we directly compute the expectation? Why the Bias Calculation is only theoretical in ML?`

`Variance Detailed`

`Mean Squared Error (MSE)`

`Types of MSE`

`Modern Machine Learning Twist 🧠 (Double Descent Phenomenon) (The end of the bias-variance trade-off?)`

`[1] The Classical Regime (Under-parameterized:` $P<N$ `) (U-Curve): Underfitting — Sweet Spot — Overfitting`

`[2] What Happens at the Interpolation Threshold?` `(` $P≈N$ `)`

`[3] The Modern Interpolating Regime (Over-parameterized:` $P≫N$ `)`

`Factors Influencing Double Descent`

`🔥 How we indirectly detect (diagnose) Bias and Variance in practice`

`🔥 Managing Bias and Variance: Techniques and Strategies`

`Techniques to Manage Bias`

`Techniques to Manage Variance`

`Other Losses - Decomposition of Bias and Variance`