Yusuf Elnady Logo
Back to Notes

Bias and Variance

Last updated: 7/7/2025

Bias and Variance

Supervised machine learning endeavors to learn a mapping function, denoted as ff, that accurately maps input data XX to an output variable YY. The fundamental goal is not merely to perform well on the data used for training the model, but to generalize effectively, making accurate predictions on new, previously unseen data. The true measure of a model's success lies in its performance on this unseen data, often quantified by its prediction error.
💡
The total error associated with a machine learning model is NOT a single, indivisible quantity.

Prediction errors can be decomposed into two main subcomponents of interest: error from bias and error from variance. In addition to (Irreducible Error).

💡
The tradeoff between a model's ability to minimize bias and variance is foundational to training machine learning models, so it's worth taking the time to understand the concept.
💡
Bias and Variance: Two Concepts Easy to Learn — Difficult to Master.

Bias

Bias: It’s the error introduced by approximating a real-world problem (which may be extremely complex) by a much simpler model.

High bias: The model makes strong assumptions and fails to capture the data’s patterns well — this is called underfitting.

Example (High Bias): Using a linear model to fit a clearly non-linear relationship.

  • High training error + High validation errorHigh Bias

    (Model is too simple, underfitting)

Variance

Variance: It’s how much the model’s predictions change when trained on different data.

High variance: The model captures too much noise from the training data — this is called overfitting.

Example (High Variance): A deep decision tree that performs perfectly on training data but fails on test data.

  • Low training error + High validation errorHigh Variance

    (model is too complex, overfitting)

Irreducible Error (ϵ\epsilon)

  • In addition to the errors of the model (Bias) and (Variance) —> There exists a third component known as irreducible error.
  • It’s an error that simply cannot be reduced by any model.
  • Analogy (Archer): Even the best archer in the world can NOT predict a sudden, unpredictable gust of wind that occurs just as the arrow leaves the bow. This unpredictable element is like an irreducible error.
  • In supervised learning, we often assume the real world is messy.

Sources of Irreducible Error inside Data

IMPORTANT: Irreducible error stems not from the model itself, but from inherent properties of the data.

  1. Measurement Errors: The tools and methods are often imperfect. Sensors might have limited precision, readings can fluctuate, or human error —> inaccuracies that deviate from the true underlying values.
  2. Inherent Randomness (Noise): Many real-world phenomena possess an intrinsic element of stochasticity or randomness.
    • For example, predicting the exact outcome of a coin flip or the precise movement of a stock price involves inherent unpredictability that no model can fully capture.
  3. Unobserved Features: There may be factors or features that influence the target variable but are not measured or included in the dataset used for modeling.
    • These hidden features can be the source of the noise in the relationship. We might have oversimplified the data model.

🤌🏼
It’s irreducible because it lies outside the scope of what the modeling process can control.

Improving the learning algorithm, tuning hyperparameters, or adding more data (of the same type) cannot eliminate noise originating from the fundamental nature of the problem or the data collection process itself.

🤌🏼
The irreducible error is the price of randomness — it’s nature’s way of saying: ‘No matter how smart you are, you can’t predict everything.’

It is irreducible because no model, no matter how smart, can predict randomness.

Formula

  • If we have a ground truth yy and a true underlying function —> then y=f(x)+ϵy = f(x) + \epsilon
  • We wish to create f^(x)\hat{f}(x) and even if it matches exactly f(x)f(x) — we still have ϵ\epsilon.
  • The output of f^(x)\hat{f}(x) is y^\hat{y}
  • ϵ\epsilon is independent of new input xx and the training data DD.
  • Mathematically: E(ϵ)=0\mathbf{E}(\epsilon) = 0
  • Mathematically Var(ϵ)=E(ϵ2)=σ2Var(\epsilon)= \mathbf{E} (\epsilon^2) = \sigma^2
    • That’s because Var(ϵ)=E[ϵE[ϵ]]2=E[ϵ0]2=E[ϵ]2=σ2Var(\epsilon) = \mathbf{E}[\epsilon - \mathbf{E}[\epsilon]]^2 = \mathbf{E}[\epsilon - 0]^2 = \mathbf{E}[\epsilon]^2 = \sigma^2

🎯 Why Do We Assume the Noise 𝜖 Has Zero Mean E(ϵ)=0\mathbf{E}(\epsilon) = 0?

  • By definition, we assume it’s pure randomness. —> If noise had a non-zero mean, it would mean there’s a systematic shift — a pattern we could model!
  • Another intuition: Noise = unpredictable deviations up or down, equally likely.
  • Imagine flipping a coin —> The noise is fair (half heads, half tails) —> Over many flips, the noise should cancel out → zero mean. Otherwise, it's not random noise — it's a systematic bias.
  • In standard ML and regression, zero-mean noise is the default because we want to cleanly separate randomness from predictability.

The Bias-Variance Tradeoff

  • The bias-variance tradeoff is one of the most fundamental concepts in supervised machine learning. It describes the inherent inverse relationship between the bias and the variance of a learning algorithm.
  • Generally, actions taken to decrease a model's bias tend to result in an increase in its variance. Conversely, efforts to reduce a model's variance often lead to an increase in its bias.
  • The goal is not to eliminate either bias or variance completely, but rather to find the "sweet spot”.

Target Shooting Analogy

  • Your goal is to shoot arrows at the center of a target (the bullseye = true function f(x)f(x))
ConceptWhat it MeansArcher Analogy
BiasHow far your average aim is from the true center.If you are systematically aiming off-center (e.g., all your arrows land near the top-right of the target), you have high bias.
VarianceHow much your shots spread out around your average aim.If your arrows are scattered widely all over, you have high variance. If your arrows are tight together, you have low variance.
Irreducible ErrorRandom unavoidable noise in the system.Even if you're perfect, some slight wind or hand shake could cause slight randomness.
BiasVarianceWhat HappensDescription
Low BiasLow VarianceArrows tightly around bullseyeYou're accurate and consistent — ideal model!
High BiasLow VarianceArrows tightly grouped but far from centerYou're consistent but wrong — like a bad model that's underfitting
Low BiasHigh VarianceArrows centered on bullseye but spread outYou aim right, but you're inconsistent — like a model overfitting different data
High BiasHigh VarianceArrows scattered and off-centerYou're both wrong and inconsistent — worst case

Curve Fitting Analogy

  • Even though we only discuss Bias-Variance from a regression perspective, keep in mind that the practical implications of the bias-variance tradeoff are applicable to all supervised learning contexts.

Test Analogy

  • A nervous student (high variance) panicking at every test.
  • A confident-but-wrong student (high bias) repeating the same mistakes.

Algorithm Tuning

Many algorithms have parameters that directly control complexity and thus navigate the tradeoff.

  • K-Nearest Neighbors (k-NN): A small k leads to low bias/high variance; increasing k increases bias/decreases variance.
  • Support Vector Machines (SVM): The regularization parameter C controls the tradeoff. High C leads to low bias/high variance; low C leads to high bias/low variance.

KNN Example

  • KNN is explained here:
  • Remember bias pays very little attention to the training data and oversimplifies the model.
  • If K=1K=1, and we have a test point —> the model will look at the closest training point, and copies its class, instead of looking at the closest KK and choose from them.

    We call K=1K=1 an overfitting (low Bias, High Variance), because of the following 4 reasons:

    1. Memorizes noise: If a noisy label is close to a test point, it will copy it exactly.
    2. Highly sensitive to tiny changes: Adding or removing one point can change the prediction.
    3. Complex decision boundaries: If you color later the areas based on the classifications, you will find the boundary flips often between classes — jagged, unstable.
    4. It achieves 0 training error: Because when we want to classify a training point, it looks around, Oh I can use myself because I am training datapoint, so it picks itself, but if larger k it would consider itself and others.
  • If Large K (e.g., k=68k=68):
    • Even if a minority class dominates locally, the large majority from far away may override it.
    • Model is stable, but can miss patterns (underfit) (High Bias, Low Variance) ; stable because almost always predicts the same class since the majority voting always contains all training points.
  • If Intermediate K (5K205 ≤ K ≤ 20):
    • Good balance between locality and smoothing, ✅ Moderate bias, ✅ moderate variance
    • Sweet spot of generalization

Bias Detailed

In Statistics

Bias(θ^)=E[θ^]θBias(\hat{\theta}) = \mathbf{E}[\hat{\theta}] -\theta
  • Remember in statistics: A parameter is a number describing a whole population (e.g., population mean), while a statistic is a number describing a sample (e.g., sample mean).
    • The statistics (Sample Mean, Sample Standard Deviation (Standard Error), etc) are estimators of the parameters.
    • We want to ensure they accurately represent the population, so we find and derive the Unbiased Estimators.
    • Any statistic θ^ \hat{\theta} is an Unbiased Estimator for a parameter θ\theta if: E[θ^]=θE[ \hat{\theta} ]=θ —> Which means the expected value of the estimator is exactly the population parameter.
    • In a nutshell: we want to check if we have multiple sample datasets (size=n) —> the average (expected value) of their statistic will be as our population parameter.
    • More on deriving Unbiased Estimators at: ,

In ML (Supervised Models) (Theoretical Definition)

Bias(f^(x))=E[f^(x)]f(x){Bias}(\hat{f}(x)) = \mathbb{E}[\hat{f}(x)] - f(x)

We still have the same Bias definition: it’s the difference between the average prediction of a model and the true value.

  1. Same as we studied in irreducible error: Our Goal is to learn the true function f(x)f(x) and create our own model f^(x)\hat{f}(x) because f(x)f(x) is unkonw usually.
    • Remember that the true value is y=f(x)+ϵ.y=f(x)+ϵ.
  2. We use a specific training dataset DD to train a model, which results in a learned function f^(x;D)\hat{f}(x;D) — Often just written as f^(x)\hat{f}(x).
    • Our learned function depends entirely on the specific random training dataset DD.
    • If another DD, we’d get a slightly different f^(x)\hat{f}(x).
  3. Now, let's focus on a new single specific input point xx. We test it using our model f^(x)\hat{f}(x) which is trained on a specific dataset D.
    • IMPORTANT CRUCIAL: Since the model / estimator of this single point depends on the random dataset D, then the prediction f^(x)\hat{f}(x) at this specific point x is also a random variable (Because we can retrain the model on many different possible datasets DD).
  4. Imagine if we estimated f^(x)\hat{f}(x) after we trained different versions of the model on all possible training datasets D of the same size.
    • We can call the average of these predictions ED[f^(x)]\mathbb{E}_D[\hat{f}(x)] or simply E[f^(x)]\mathbb{E}[\hat{f}(x)]
  5. Finally: The Point-wise Bias: The bias of machine learning model f^\hat{f} at the specific point xx is Bias(x)=E[f^(x)]f(x){Bias}(x) = \mathbb{E}[\hat{f}(x)] - f(x).
ℹ️
  • Bias here measures: On average, how wrong is your model at a point x?
  • Bias represents a systematic error inherent in the model. The model's tendency to consistently miss the true value, regardless of the particular training data used.
  • This systematic deviation arises from the model's inherent limitations or the assumptions it makes about the data.
🤌🏻
Where this "many training datasets" idea comes from?

When we talk about bias and variance formally in machine learning, we imagine (theoretically) that: There is a huge population of possible training datasets you could sample from the real world.

IMPORTANT: Theoretical bias and variance exist for mathematical understanding, but practical ML uses validation sets, cross-validation, regularization, etc., to control bias/variance WITHOUT directly computing them.

📖 Why can't we directly compute the expectation? Why the Bias Calculation is only theoretical in ML?

It is important to distinguish between the theoretical definition of bias and variance and how it is diagnosed in practice. The formal definition involves calculating the expected value over an infinite number of hypothetical training sets drawn from the true data distribution. This theoretical quantity measures the inherent instability of the learning algorithm given the data distribution and sample size.

  • We'd need to retrain the model on all possible datasets — an infinite process.
  • We'd need access to the true function f(x)f(x)f(x) — impossible in real life.
  • We'd need to sample infinite noise patterns ϵ\epsilon — again impossible.
🔥
In practical ML workflows, practitioners typically work with only one or a limited number of training datasets. It is impossible to directly compute the expectation over all possible datasets.

Solution: variance and bias are diagnosed indirectly by observing the model's performance.

We discuss this here:

⚠️ Important Note: Unbiased ≠ Best

  • An estimator (statistic) θ^\hat{\theta} is unbiased for a parameter θ\theta if: E[ θ ]=θE[ θ^ ]=θ
  • An unbiased estimator might still be bad if its variance is hugethat is, the estimates are all over the place.
  • Key to Understand: like a dart thrower whose arrows are wildly scattered (High Variance) but on average centered (Unbiased).
    • He might miss the bullseye each time but the average landing spot is exactly at the center.
  • We will use MSE to evaluate both bias and variance.
Rule of Thumb:

❗ Unbiased does not mean optimal. ✅ Optimal = lowest MSE, even if biased.


Variance Detailed

Var(θ^)=E[(θ^E(θ^)2]Var(\hat{\theta}) = \mathbf{E}[(\hat{\theta}-\mathbf{E}(\hat{\theta})^2]
Var(f^(x))=E[(f^(x)E[f^(x)])2]Var(\hat{f}(x)) = \mathbf{E}[(\hat{f}(x)-\mathbf{E}[\hat{f}(x)])^2]
  • This is just the formal writing of a standard deviation becoming variance —> σ2=(xiXˉ)2n\sigma^2 = \sum\frac{(x_i-\bar{X})^2}{n}
  • Formally: The variance of a model's prediction f^(x)\hat{f}(x) at point xx is the expected squared difference between the prediction made by a model trained on a specific dataset D and the average prediction over all possible datasets D.
    • Same multiple training dataset definitions as we did in Bias
  • Conceptually: High variance is synonymous with high model sensitivity to the training data. It is sensitivity to fluctuations, noise within the training data ; small changes in the training set can lead to significantly different learned models and predictions.
  • An overfit model fails to generalize because it has learned patterns that do not exist in the broader data distribution, mistaking noise for signal.
  • A significant gap between low training error and high validation error is interpreted as a strong indicator of high variance.

Mean Squared Error (MSE)

  • MSE is a criterion tries to take into account both concerns (Bias) and (Variance).
  • MSE is a way to measure the goodness of the estimator.
  • MSE is the average of squared errors.

Scenario: Choosing Between Two Estimators 🎭

  • Imagine we have two models, which one would you prefer? Isn't "unbiased" always better?
    EstimatorBiasVariance
    Estimator A✅ Small bias✅ Small variance
    Estimator B✅ Zero bias (Unbiased)❌ Very high variance
  • Estimator A: Slightly biased but very stable.
  • Estimator B: Unbiased but very unstable.
  • Actually we would prefer Estimator A! Because small bias + low variance → often better than zero bias + huge variance.
  • To make it easier, we actually need to minimize MSE!

Types of MSE

IMPORTANT: MSE is the Expected Squared Error — but Expected over what exactly?

In truth, there are several distinct flavors of theoretical MSE. Yet, in much of the machine learning literature, authors often refer to "MSE" without clearly specifying which version they mean.

  1. First Type of MSE: Fixed Single Test Input xx (Training Randomness Only) → We evaluate the performance of a model at a specific test point xx, by imagining many possible training datasets.
    • Here, the randomness comes only from the training dataset.
    • We ask: "If we trained on different datasets, how would the model behave at this exact xx?"
  2. Second Type of MSE: Fixed Single Trained Model (Input Randomness Only) We evaluate the performance of a single trained model, by measuring its error across different input points drawn from the data distribution.
    🔥
    This version is often used in practical machine learning, where we assume the model is fixed and test performance varies with the input data.
  3. Third Type of MSE: Full Expectation (Training and Input Randomness) → We evaluate a model’s performance over both sources of randomness
    • This involves a double expectation: first over training datasets, then over input points.

The good news is that, regardless of which MSE flavor you are faced with, they all admit a decomposition into bias-squared, variance, and noise.

Derivation (Using Training Dataset Randomness)

  1. MSE=E[(yf^(x))2]MSE =\mathbf{E}[(y− \hat{f}​ (x))^2 ]
  2. We have y=f(x)+ϵy=f(x) + \epsilon, so MSE=E[(f(x)+ϵf^(x))2]MSE =\mathbf{E}[(f(x) + \epsilon− \hat{f}​ (x))^2 ]
  3. Rearrange the terms inside the parenthesis: E[((f(x)f^(x))+ϵ)2]\mathbf{E}[((f(x)− \hat{f} (x))+ϵ)^2 ]
  4. Expand the square (a+b)^2 =a^2+2ab+b^2: E[(f(x)f^(x))2+2ϵ(f(x)f^(x))+ϵ2]\mathbf{E}[(f(x)− \hat{f}(x)) ^2 +2ϵ(f(x)− \hat{f}(x))+ϵ^ 2 ]
  5. Use the linearity of expectation E[(f(x)f^(x))2]+E[2ϵ(f(x)f^(x))]+E[ϵ2]\mathbf{E}[(f(x)− \hat{f}​ (x))^ 2 ]+\mathbf{E}[2ϵ(f(x)− \hat{f} (x))]+\mathbf{E}[ϵ ^2 ]
    1. Since E[ϵ]=0\mathbf{E}[ϵ]=0, we have Var(ϵ)=E[ϵE[ϵ]]2=E[ϵ0]2=E[ϵ]2=σ2Var(\epsilon) = \mathbf{E}[\epsilon - \mathbf{E}[\epsilon]]^2 = \mathbf{E}[\epsilon - 0]^2 = \mathbf{E}[\epsilon]^2 = \sigma^2
    2. For the middle term: E[2ϵ(f(x)f^(x))]=2E[ϵ]E[f(x)f^(x)]\mathbf{E}[2ϵ(f(x)− \hat{f} (x))] = 2\mathbf{E}[ϵ]\mathbf{E}[f(x)− \hat{f} (x)] because the noise of point (x,y) is independent of training data DD and the function f(x)f(x). Since E[ϵ]\mathbf{E}[\epsilon] is zero, so the whole term is ZERO.
  6. We are still remained with the piece E[(f(x)f^(x))2]\mathbf{E}[(f(x)− \hat{f}​ (x))^ 2]
    1. We now (a-b)^2 = a^2-2ab+b^2 —> E[(f(x)2+2f(x)f^(x)+f^(x)2]\mathbf{E}[(f(x)^2 + 2 f(x) \hat{f}(x) + \hat{f}(x)^2]
    2. Use the linearty of expectation: E[(f(x)2]+2E[f(x)f^(x)]+E[f^(x)2]\mathbf{E}[(f(x)^2] + 2 \mathbf{E}[f(x) \hat{f}(x)] + \mathbf{E}[\hat{f}(x)^2]
    3. We know that f(x)f(x) is a fixed value the (True function at x), so —> f(x)22f(x)E[f^(x)]+E[f^(x)2]f(x)^2 - 2 f(x)\mathbf{E}[ \hat{f}(x)] + \mathbf{E}[\hat{f}(x)^2]
    4. Add and Subtract E[f^(x)]2\mathbf{E}[\hat{f}(x)]^2 —> f(x)2E[f^(x)]2+E[f^(x)]22f(x)E[f^(x)]+E[f^(x)2]f(x)^2 -\mathbf{E}[\hat{f}(x)]^2 + \mathbf{E}[\hat{f}(x)]^2- 2 f(x)\mathbf{E}[ \hat{f}(x)] + \mathbf{E}[\hat{f}(x)^2]
    5. Now the trick is we can create Bias^2 and Variance from step (d)
      1. f(x)2+E[f^(x)]22f(x)E[f^(x)]f(x)^2 + \mathbf{E}[\hat{f}(x)]^2- 2 f(x)\mathbf{E}[ \hat{f}(x)] is just (f(x)E[f^(x)])2(f(x) - \mathbf{E}[\hat{f}(x)])^2 which is the Bias(f^(θ))2Bias(\hat{f}(\theta))^2
      2. We are remained with E[f^(x)]2+E[f^(x)2]=E[f^(x)2]E[f^(x)]2 -\mathbf{E}[\hat{f}(x)]^2 + \mathbf{E}[\hat{f}(x)^2] = \mathbf{E}[\hat{f}(x)^2] - \mathbf{E}[\hat{f}(x)]^2 which is the definition of Var(f^(θ))Var(\hat{f}(\theta))
  7. All in all 5.a and 6.e: MSE=Var(f^(θ))+Bias(f^(θ))2+σ2MSE = Var(\hat{f}(\theta)) + Bias(\hat{f}(\theta))^2 + \sigma^2

Why the irreducible error is the lower bound on the expected prediction error?

  • Non-Negative Components (MSE):
    • Bias2Bias^2 is always (≥0) because it's a squared value.
    • Variance Var(f^(x))Var(\hat{f}(x)) is always (≥0) by definition.
    • Irreducible error has a variance σ2=Var(ϵ)σ^2=Var(ϵ) which is also always (≥0). In most real-world problems, there's some noise, so σ2>0σ^2 >0.
  • The Lower Limit: If Bias and Variance are zero, the expected minimum error (MSE) would satisfy: E[(yf^(x))2]0+0+σ2E[(y−\hat{f}(x))^2]≥0+0+σ^2
No model, no matter how sophisticated or perfectly trained, can achieve an expected squared error lower than σ2σ^2 (at point x, averaged over datasets and noise).

Modern Machine Learning Twist 🧠 (Double Descent Phenomenon) (The end of the bias-variance trade-off?)

🔥
✅ In modern ML, especially deep learning, things are a little surprising:
  • Very large models (millions or billions of parameters) can still generalize well if trained carefully.
  • This goes against the "classical" view where too much complexity must cause overfitting.
  • The term Double Descent was coined in Dec 2018 in Reconciling Modern ML Practice and the Bias–Variance Trade-off, then in Dec 2019 OpenAI published a paper and article Deep Double Descent that shows that effect in CNNs, ResNets, transformers and epoch-wise training.
    • Since then Dozens of papers (e.g. “On the Role of Optimization in Double Descent”, “Double Descent Demystified”) extend the phenomenon to linear regression, random forests, kernel methods and analyze implicit regularization of SGD.
  • Resources: , ,
  • The Second Descent reveals behavior of (extreme overparameterization) that is not typically analyzed classically. We observe strong test performance from very overfit, complex models.
  • It’s counterintuitive because you would expect DL training to be monotonic.

[1] The Classical Regime (Under-parameterized: P<NP<N) (U-Curve): Underfitting — Sweet Spot — Overfitting

  • Classical Tradeoff: increasing model complexity (e.g., adding parameters) decreases bias (the model becomes more flexible and can approximate the true function better) but increases variance (the model becomes more sensitive to the specific training points, including noise).
  • Classical Tradeoff leads to the classical U-shaped curve for test error vs model complexity.
  • The Blue Line (Test Error):
    1. Underfitting Region: Initially, increasing complexity improves test performance (lower bias).
    2. Sweet Spot (Optimal Balance): There's a balance where test error is minimized.
    3. Overfitting Region: Keep increasing complexity and test error starts to climb. (This is more cumbersome than underfitting)

  • The Red Line (Train Error):
    • It keeps decreasing until reaching the interpolation threshold.
    • Interpolation Threshold: The Point where training error first becomes exactly zero. (High Variance)
    • Interpolation Threshold: Means the models is so wiggly and the model has perfectly fit and memorized the entire training set passing through each datapoint.
    • Interpolation Threshold: Mainly deals with the training error being ZERO, in this case the Test Error can be at its peak.
https://medium.com/@fernando.dijkinga/double-descent-the-surprising-phenomenon-challenging-deep-learning-theory-679a8440d37f
https://medium.com/@fernando.dijkinga/double-descent-the-surprising-phenomenon-challenging-deep-learning-theory-679a8440d37f

[2] What Happens at the Interpolation Threshold? (PNP≈N)

  • Interpolation literal meaning here is that our model has interpolated between every single training point; It’s drawing a curve that runs through every single training point.
  • P=P = Number of Parameters, N=N = Number of Training Samples
At the interpolation threshold, there can often be a single choice of model that works, and there is no reason to believe that model will be good for production.
  • At this threshold we find P=NP = N, but why that makes the training error zero? Why do we have only one way to interpolate the data?
  • Let’s use polynomial regression as an example:
    • You have 55 training datapoints (x1,y1),(x2,y2),(x3,y3),(x4,y4),(x5,y5)(x_1, y_1), (x_2, y_2), (x_3, y_3), (x_4, y_4), (x_5, y_5).
    • You have only 11 input column / feature.
    • To have P=NP = N, then we need 55 coefficients in the model —> Which means a polynomial function of degree 4.
    • Our model is y=β0+β1x1+β2x2+β3x3+β4x4y = \beta_0 + \beta_1x^1 + \beta_2x^2 + \beta_3x^3+\beta_4x^4 OR y=a0+a1x1+a2x2+a3x3+a4x4y = a_0 + a_1x^1 + a_2x^2 + a_3x^3+a_4x^4
    • So it’s like we have 55 equations (How many samples) with 55 unknowns coefficients (a0,a1,a2,a3,a4)a_0, a_1, a_2, a_3, a_4)
    • So (in general, unless the matrix is badly degenerate), one unique solution.
      • Degenerate and Matrix Inverses are discussed here:
  • This is why at the interpolation threshold we have only ONE SINGLE SOLUTION. That unique solution can be weird, wiggly, and overfit.
    • In the next phase (Over-Parameterized Regime), we will have multiples of solutions to choose from.
  • What if we have two features, how can we do Polynomial Regression Up to N=PN = P?
    • It’s so simple —> f(x1,x2)=y=a0+a1x1+a2x2+a3x12+a4x1x2+a5x22+f(x_1, x_2) = y = a_0 + a_1x_1 + a_2x_2 + a_3x_1^2 + a_4x_1x_2 + a_5x_2^2 + \cdots
    • We expand to many polynomial terms (combined features), and adjust the coefficients over those expanded features.
    • Terms are powers and cross-products of x1x_1 and x2x_2.

[3] The Modern Interpolating Regime (Over-parameterized: PNP≫N)

  • After the interpolation threshold, As the model complexity PP increases —> the curve does not continue to rise indefinitely, instead the second descent happens in the Test Error —> We get the Double Descent Curve!
  • It’s the characteristic of many successful modern deep learning models.
  • The Double Descent Phenomenon is not confined to a specific niche model or dataset.
    • It has been proved to apply in linear regression, ridge regression, Neural Networks, Convolutional Neural Networks (CNNs), Residual Networks (ResNets), Transformers, Decision Trees and Ensemble Methods.
    • It has been tested on various datasets, including standard benchmarks like CIFAR-10 and CIFAR-100 (often with added label noise to amplify the effect) , ImageNet , MNIST , as well as synthetic datasets (e.g., Gaussian data for theoretical analysis) and other real-world regression and classification datasets

The question is why does it work when PNP≫N?

  • Empirically بالتجربة this is well proven to happen, but theoretically why it happens is still an active area of research 😮
  • Existing explanations often rely on simplified models (like linear regression or random features).
Unsupported block type: toggle

A quick simple analogy (IMPORTANT): You have 10 points to fit, and you have exactly 10 sticks (parameters).

Implicit Regularization and Optimization Bias

  • This is the most central theme in explaining the second descent.
  1. After the interpolation threshold, every model onwards passes through each training data point.
  2. Now there are infinitely many models that can fit the training data perfectly.
  3. Some of these solutions are ugly, but some are beautiful:
    • The only thing that changes is how the model connects the in-between points.
    • As the models become more and more complex, these connections can become smoother, and the resulting prediction may fit your test data better.
🔥
Optimizers like SGD "choose" the nicer solutions (smooth, low-norm) naturally from the infinite solutions. But HOW? The details are here:
  • In general, it leans towards solutions of (Minimum L2 norm (simpler weights) zero-residual solution, Flatter minima (less sharp loss surface), Smoother functions).
🔥
Two Core Principles:

1. At the interpolation threshold, there can often be a single choice of model that works.

2. In the limit of infinitely large models, there will be a vast number of interpolating models, and we can pick the best amongst them.

Factors Influencing Double Descent

The precise shape, peak location, and height of the double descent curve are not fixed but are influenced by the following factors (the model, the data, the training process, the regularizaiton, sparsity, and noise).

[1] Model Architecture and Size (PP) Model-wise Double Descent

  • This is the most frequently studied form —> varying the size of the model (often the network width) while keeping the dataset and training procedure fixed.
  • Increasing PP is a necessity to move us to the Over-parameterized region.
  • Increasing network width (Number of Neurons in Layer) often leads to the canonical double descent behavior.
  • In contrast, increasing network depth (Number of Layers) beyond a certain point, while also increasing PP, has been observed in some studies (e.g., with ResNets) to worsen test performance monotonically.
  • Note: Architectural Choices induce different implicit biases or optimization landscapes, not simply increasing PP.

[2] Dataset Size (NN) Sample-wise Non-monotonicity

  • Describes a state where more samples hurts the performance of the model.
  • Increasing N shifts the double descent peak towards larger model sizes P.
  • Interesting: If PP is fixed, increasing N can push the model away from the second descent towards the interpolation threshold —> This highlights the need to co-evolve model size and dataset size.

[3] Training Duration (Epochs) Epoch-wise Double Descent

  • There is a regime where training longer reverses overfitting.
  • Continue training even if overfitting (peak error), and potentially a second descent.
  • Don’t do Early Stopping on Test / Validation dataset — It can prevent the double descent curve, making you stop near the first minimum.
  • The existence of epoch-wise double descent complicates the standard practice of early stopping based solely on validation error increase
    https://arxiv.org/pdf/1912.02292
    https://arxiv.org/pdf/1912.02292

[4] Regularization: Regularization-wise Double Descent

  • Regularization = Any method that encourages the model to avoid overfitting and generalize better.
  • Explicit Regularization: You add it manually (in loss or architecture)
    • L2 weight decay (Ridge), Dropout, Early stopping, Data augmentation
  • Implicit Regularization / Implicit Bias: It happens naturally from training dynamics
    • The optimizer Choice (e.g., SGD, Adam, RMSprop), Initial weights, Batch Normalization, Residual Connections
  • Explicit Regularization is a factor because it can flatten or remove the harmful test error peak at the interpolation threshold, Making the transition from under- to overparameterized smoother
  • Implicit Regularization is a factor because this implicit bias guides the selection from the many interpolating solutions in the P>N regime, favoring smooth curves and low-norm weights.

[5] Sparsity:

  • Sparsity means: many parameters (weights) are zero —> You "prune" (remove) neurons, weights, or filters from a network after training (or even during training).
  • Intuition says: fewer parameters = lower capacity → less overfitting, more generalization.
  • In Double Descent, Sparsity is less common mentioned, but surprisingly it follows different approach: Worsen —> Improve —> Worsen
  • When you gradually increase sparsity in a trained model:
    StageBehaviorWhy
    Low sparsity (small pruning)Test performance worsens firstMaybe you're deleting useful redundancies, disrupting the network's fine-tuned balance.Small amounts of pruning can break important structures inside the network.
    Moderate sparsityTest performance improvesNow sparsity acts like strong regularization, suppressing overfitting, letting true patterns dominate.regularizes nicely.
    Extreme sparsityTest performance worsens againNow you're underfitting — model can't even represent the true signal anymore.damages the ability to learn at all (underfitting).

[6] Data Noise Levels

  • Label noise = some training examples have the wrong labels. (For example, an image of a "cat" is mistakenly labeled "dog.")
  • The model doesn’t know which labels are wrong — it tries to fit everything.
  • With noisy labels, to achieve perfect fit (zero training error), the model is forced to contort itself — bend unnaturally — to also fit the wrong labels.
  • Even without artificial noise, real-world complex datasets (like CIFAR-10, ImageNet) naturally have label noise or label ambiguity. Thus, even clean-looking datasets can exhibit double descent — but adding noise makes the effect much more dramatic and easier to observe.
  • Label Noise in training data amplifies the peak of the double descent curve, because the model is now worse on predicting test data, and just memorized wrong data.
🔥
Finally, if there is one thing both classical statisticians and deep learning practitioners agree on is “more data is always better

Factors Influencing Double Descent (DD) Characteristics

FactorTypical Effect on DD CurveNotes
Model Size (Width ↑)Increasing the model complexity, especially width (Number of Nodes per Layer) moves along the x-axis; so we observe the full DD.
Model Size (Depth ↑)Same as Width, but may degrade performance beyond a point; therefore second descent not guaranteed Behavior is architecture-dependent (e.g., ResNets often handle depth better)
Dataset Size (N ↑)Shifts peak to higher P, because more P needed until interpolation; It changes location of interpolation threshold (P≈N) and can temporarily increase error if P is fixed near critical regime
Training Epochs ↑Traverses epoch-wise DD curve; reverses overfitting in large models.
Explicit Regularization ↑ (e.g., L2, Dropout)Reduces/eliminates peak height; smooths curveThis is because these methods are in the first place to prevent overfitting, so we should not have interpolation point at all, therefore no DD, but monotonic smooth curve.
Implicit Optimization (Optimizer Choice)Alters curve shape via different implicit biases Different optimizers (e.g., SGD vs. Adam) favor different solution regions. Impacts generalization and DD visibility.
Sparsity ↑Produces non-monotonic “sparse DD”: error ↑ then ↓ then ↑ again Initially hurts (damages structure), then helps (regularization), finally harms again (underfitting).
Label Noise ↑Amplifies the interpolation peak; worsens test error near thresholdForces models near 𝑃≈𝑁 to fit incorrect labels, yielding unstable, poor solutions. Makes DD behavior more pronounced.

🔥 How we indirectly detect (diagnose) Bias and Variance in practice

  1. Training vs Validation Learning Curves: Plotting the model's performance (e.g., loss or accuracy) vs. training time (epochs) or dataset size.
    • High Bias:
      • Both curves have high error level. They are close together, but the error is unacceptably high. Increasing training time or data doesn't help if the model is simple.
    • High Variance:
      • There's a significant gap between the two curves.
      • One way to decrease variance would be to add more training data. However, there is a price to pay: we will have less test data.
    • Good Fit: Both converge at a low error level with only a small gap between them.
https://analystprep.com/study-notes/cfa-level-2/quantitative-method/overfitting-methods-addressing/
https://analystprep.com/study-notes/cfa-level-2/quantitative-method/overfitting-methods-addressing/
Resubstitution Error
  • Error obtained from Training Data.
  • You train and test on same data to get the Best Case Scenario. It’s optimistic measure.
  • If Low Resubstitution Error + High Test Error —> Overfitting (High Variance)
  • If High Resubstitution Error + High Test Error —> Underfitting (High Bias)
  • Resubstitution Error is Upper bound on bias, lower bound on variance —> It tells you little about how predictions will fluctuate on new data; variance is invisible here.
  1. Model Complexity:
    • Very simple model (linear regression for a complex problem, shallow neural network, low-degree polynomial) and seeing poor performance everywhere, high bias is a likely suspect.
    • Very complex model (very deep network, high-degree polynomial, decision tree with no depth limit) without sufficient regularization, and a large gap between training and validation performance, then high variance.
  2. Cross-Validation Results:
    • If performance is consistently poor across all folds of cross-validation, it points towards high bias.
    • If performance varies drastically between different folds (very good on some, poor on others), it suggests sensitivity to training data subset, indicating high variance.
    • More details about it here:
    • ✅ Always evaluate across multiple seeds or folds.

🔥 Managing Bias and Variance: Techniques and Strategies

Techniques to Manage Bias

  • Use More Complex Models: Switch to a More Powerful Model or Increase Complexity within the Model
  • Decrease Regularization:
    • Regularization methods (like L1, L2, dropout) are primarily used to combat overfitting by penalizing model complexity. If a model is underfitting, it might be because the regularization is too strong.
    • It gives the model more freedom to fit the data. Removing regularization entirely might also be considered.
  • Train Longer or Increase Training Data (Use with Caution)
    • For iterative algorithms like neural networks trained with gradient descent, underfitting might occur if the training process is stopped too early before the model has converged.
    • Learning curves are essential to look at.
  • Feature Engineering / Embeddings: The input features do not contain enough information to predict the target accurately
    • Replace Static word2vec with Context-Aware BERT/RoBERTa/LLM embeddings
    • Train embeddings jointly with task: Don’t freeze pre-trained embeddings; let them adapt to your task
    • Collect or derive additional relevant features that capture more aspects of the problem. Domain expertise is often crucial here.
    • Create new features from existing ones, such as interaction terms (products of features)
  • Boosting: Trains models sequentially, with each new model focusing on correcting the errors made by the previous ones.
    • Examples include AdaBoost, Gradient Boosting Machines (GBM), XGBoost, LightGBM.
    • Each new model tries to fix the errors (biases) of the previous one.
    • While primarily aimed at reducing bias by combining weak learners, variance can increase, especially if base learners are too deep or too many rounds are used.
    • Again, it might increase variance if it keeps fitting harder and harder examples.

Techniques to Manage Variance

  • Feature Selection / Dimensionality Reduction:
    • Overfitting can occur if the model uses too many features, especially irrelevant or noisy ones.
    • Feature Selection: Identify and remove features that have little predictive power or are redundant. Techniques like using L1 regularization or statistical tests can aid selection.
    • Dimensionality Reduction: Algorithms like PCA that project the data onto a lower-dimensional space while preserving most of the variance.
  • Regularization: L1/L2 penalties, weight decay, dropout, attention dropout
    • More about it in Supervised Machine Learning Course and Deep Learning Course
    • In Linear Regression, when we have θ0+θ1x+θ2x2\theta_0 + \theta_1x+\theta_2x^2 it’s a good model (perfect), but if we got θ0+θ1x+θ2x2+θ3x3+θ4x4\theta_0 + \theta_1x+\theta_2x^2 + \theta_3x^3+\theta_4x^4 we will need to decrease θ3\theta_3 and θ4\theta_4 (Penalize them), so regularize)
  • Early Stopping: Stop training before the model memorizes noise
    • This was the case before discovering Double Descent.
  • Increase Training Data
    • Often the most effective way to combat overfitting.
    • More data provides a clearer picture of the underlying patterns and makes it harder for the model to fit random noise specific to a small sample.
  • Data Augmentation: If acquiring more real data is difficult, artificially expand the training set.
    • E.g., rotating/cropping images, adding slight noise, paraphrasing text
  • Reduce Model Complexity:
    • For neural networks, reduce the number of layers or neurons. For decision trees, prune the tree (limit depth or number of leaves). For polynomial regression, reduce the degree. Use smaller kernels in SVMs.
  • Ensembling (Averaging): Combine outputs from multiple trained models (e.g., random forests, deep ensembles), often significantly reducing variance
    • Learn more here:
    • Bagging (Bootstrap Aggregating): Trains multiple instances of a base learner (often complex ones like decision trees) on different bootstrap samples (random samples with replacement) of the training data and averages their predictions. Random Forests are a prime example. Bagging primarily reduces variance.
    • Stacking (Stacked Generalization): Trains multiple different types of base models and uses another model (a meta-learner) to learn how to best combine their predictions. Aims to leverage the diverse strengths of different algorithms. It works for bias and variance, depending on meta-lerner.
⚖️
These techniques are often interconnected and their effects are not always isolated to either bias or variance.
  • Regularization, while primarily targeting high variance , inherently introduces some bias by constraining the model; finding the right regularization strength is key. Conversely, reducing regularization to combat high bias can increase variance.
  • Feature selection reduces model complexity and thus variance , but removing features that hold valuable information, even if subtle, can increase bias.

Other Losses - Decomposition of Bias and Variance

  • While the neat algebraic decomposition of MSE into Bias and Variance doesn't directly carry over to all other loss functions, but the bias and variance and their trade-off, still absolutely apply.
  • The loss functions are defined differently, so the mathematical steps used to decompose the expected error simply don't result in the same clean, additive Bias2+Variance+σ2Bias^2+Variance+σ^2 structure.
  • But because in ML we focus on Systematic Error (High Bias) or Sensitivity to Data (High Variance) regarding of the decomposition, the tradeoff still apply.
  • Regardless of the loss definition, we detect and diagnose Bias and Variance here: