Bias and Variance
Supervised machine learning endeavors to learn a mapping function, denoted as , that accurately maps input data to an output variable . The fundamental goal is not merely to perform well on the data used for training the model, but to generalize effectively, making accurate predictions on new, previously unseen data. The true measure of a model's success lies in its performance on this unseen data, often quantified by its prediction error.
Prediction errors can be decomposed into two main subcomponents of interest: error from bias and error from variance. In addition to (Irreducible Error).
Bias
Bias: It’s the error introduced by approximating a real-world problem (which may be extremely complex) by a much simpler model.
High bias: The model makes strong assumptions and fails to capture the data’s patterns well — this is called underfitting.
Example (High Bias): Using a linear model to fit a clearly non-linear relationship.
- High training error + High validation error → High Bias
(Model is too simple, underfitting)
Variance
Variance: It’s how much the model’s predictions change when trained on different data.
High variance: The model captures too much noise from the training data — this is called overfitting.
Example (High Variance): A deep decision tree that performs perfectly on training data but fails on test data.
- Low training error + High validation error → High Variance
(model is too complex, overfitting)
Irreducible Error ()
- In addition to the errors of the model (Bias) and (Variance) —> There exists a third component known as irreducible error.
- It’s an error that simply cannot be reduced by any model.
- Analogy (Archer): Even the best archer in the world can NOT predict a sudden, unpredictable gust of wind that occurs just as the arrow leaves the bow. This unpredictable element is like an irreducible error.
- In supervised learning, we often assume the real world is messy.
Sources of Irreducible Error inside Data
IMPORTANT: Irreducible error stems not from the model itself, but from inherent properties of the data.
- Measurement Errors: The tools and methods are often imperfect. Sensors might have limited precision, readings can fluctuate, or human error —> inaccuracies that deviate from the true underlying values.
- Inherent Randomness (Noise): Many real-world phenomena possess an intrinsic element of stochasticity or randomness.
- For example, predicting the exact outcome of a coin flip or the precise movement of a stock price involves inherent unpredictability that no model can fully capture.
- Unobserved Features: There may be factors or features that influence the target variable but are not measured or included in the dataset used for modeling.
- These hidden features can be the source of the noise in the relationship. We might have oversimplified the data model.
Improving the learning algorithm, tuning hyperparameters, or adding more data (of the same type) cannot eliminate noise originating from the fundamental nature of the problem or the data collection process itself.
It is irreducible because no model, no matter how smart, can predict randomness.
Formula
- If we have a ground truth and a true underlying function —> then
- We wish to create and even if it matches exactly — we still have .
- The output of is
- is independent of new input and the training data .
- Mathematically:
- Mathematically
- That’s because
🎯 Why Do We Assume the Noise 𝜖 Has Zero Mean ?
- By definition, we assume it’s pure randomness. —> If noise had a non-zero mean, it would mean there’s a systematic shift — a pattern we could model!
- Another intuition: Noise = unpredictable deviations up or down, equally likely.
- Imagine flipping a coin —> The noise is fair (half heads, half tails) —> Over many flips, the noise should cancel out → zero mean. Otherwise, it's not random noise — it's a systematic bias.
- In standard ML and regression, zero-mean noise is the default because we want to cleanly separate randomness from predictability.
The Bias-Variance Tradeoff
- The bias-variance tradeoff is one of the most fundamental concepts in supervised machine learning. It describes the inherent inverse relationship between the bias and the variance of a learning algorithm.
- Generally, actions taken to decrease a model's bias tend to result in an increase in its variance. Conversely, efforts to reduce a model's variance often lead to an increase in its bias.
- The goal is not to eliminate either bias or variance completely, but rather to find the "sweet spot”.
Target Shooting Analogy
- Your goal is to shoot arrows at the center of a target (the bullseye = true function )
| Concept | What it Means | Archer Analogy |
|---|---|---|
| Bias | How far your average aim is from the true center. | If you are systematically aiming off-center (e.g., all your arrows land near the top-right of the target), you have high bias. |
| Variance | How much your shots spread out around your average aim. | If your arrows are scattered widely all over, you have high variance. If your arrows are tight together, you have low variance. |
| Irreducible Error | Random unavoidable noise in the system. | Even if you're perfect, some slight wind or hand shake could cause slight randomness. |
| Bias | Variance | What Happens | Description |
|---|---|---|---|
| Low Bias | Low Variance | Arrows tightly around bullseye | You're accurate and consistent — ideal model! |
| High Bias | Low Variance | Arrows tightly grouped but far from center | You're consistent but wrong — like a bad model that's underfitting |
| Low Bias | High Variance | Arrows centered on bullseye but spread out | You aim right, but you're inconsistent — like a model overfitting different data |
| High Bias | High Variance | Arrows scattered and off-center | You're both wrong and inconsistent — worst case |

Curve Fitting Analogy

- Even though we only discuss Bias-Variance from a regression perspective, keep in mind that the practical implications of the bias-variance tradeoff are applicable to all supervised learning contexts.
Test Analogy
- A nervous student (high variance) panicking at every test.
- A confident-but-wrong student (high bias) repeating the same mistakes.
Algorithm Tuning
Many algorithms have parameters that directly control complexity and thus navigate the tradeoff.
- K-Nearest Neighbors (k-NN): A small k leads to low bias/high variance; increasing k increases bias/decreases variance.
- Support Vector Machines (SVM): The regularization parameter C controls the tradeoff. High C leads to low bias/high variance; low C leads to high bias/low variance.
KNN Example
- KNN is explained here:
- Remember bias pays very little attention to the training data and oversimplifies the model.
- If , and we have a test point —> the model will look at the closest training point, and copies its class, instead of looking at the closest and choose from them.
We call an overfitting (low Bias, High Variance), because of the following 4 reasons:
- Memorizes noise: If a noisy label is close to a test point, it will copy it exactly.
- Highly sensitive to tiny changes: Adding or removing one point can change the prediction.
- Complex decision boundaries: If you color later the areas based on the classifications, you will find the boundary flips often between classes — jagged, unstable.
- It achieves 0 training error: Because when we want to classify a training point, it looks around, Oh I can use myself because I am training datapoint, so it picks itself, but if larger k it would consider itself and others.
- If Large K (e.g., ):
- Even if a minority class dominates locally, the large majority from far away may override it.
- Model is stable, but can miss patterns (underfit) (High Bias, Low Variance) ; stable because almost always predicts the same class since the majority voting always contains all training points.
- If Intermediate K ():
- Good balance between locality and smoothing, ✅ Moderate bias, ✅ moderate variance
- Sweet spot of generalization
Bias Detailed
In Statistics
- Remember in statistics: A parameter is a number describing a whole population (e.g., population mean), while a statistic is a number describing a sample (e.g., sample mean).
- The statistics (Sample Mean, Sample Standard Deviation (Standard Error), etc) are estimators of the parameters.
- We want to ensure they accurately represent the population, so we find and derive the Unbiased Estimators.
- Any statistic is an Unbiased Estimator for a parameter if: —> Which means the expected value of the estimator is exactly the population parameter.
- In a nutshell: we want to check if we have multiple sample datasets (size=n) —> the average (expected value) of their statistic will be as our population parameter.

- More on deriving Unbiased Estimators at: ,
In ML (Supervised Models) (Theoretical Definition)
We still have the same Bias definition: it’s the difference between the average prediction of a model and the true value.
- Same as we studied in irreducible error: Our Goal is to learn the true function and create our own model because is unkonw usually.
- Remember that the true value is
- We use a specific training dataset to train a model, which results in a learned function — Often just written as .
- Our learned function depends entirely on the specific random training dataset .
- If another , we’d get a slightly different .
- Now, let's focus on a new single specific input point . We test it using our model which is trained on a specific dataset D.
- IMPORTANT CRUCIAL: Since the model / estimator of this single point depends on the random dataset D, then the prediction at this specific point x is also a random variable (Because we can retrain the model on many different possible datasets ).
- Imagine if we estimated after we trained different versions of the model on all possible training datasets D of the same size.
- We can call the average of these predictions or simply
- Bias here measures: On average, how wrong is your model at a point x?
- Bias represents a systematic error inherent in the model. The model's tendency to consistently miss the true value, regardless of the particular training data used.
- This systematic deviation arises from the
model's inherent limitationsorthe assumptions it makes about the data.
When we talk about bias and variance formally in machine learning, we imagine (theoretically) that: There is a huge population of possible training datasets you could sample from the real world.
IMPORTANT: Theoretical bias and variance exist for mathematical understanding, but practical ML usesvalidation sets,cross-validation,regularization, etc., to control bias/variance WITHOUT directly computing them.
📖 Why can't we directly compute the expectation? Why the Bias Calculation is only theoretical in ML?
It is important to distinguish between the theoretical definition of bias and variance and how it is diagnosed in practice. The formal definition involves calculating the expected value over an infinite number of hypothetical training sets drawn from the true data distribution. This theoretical quantity measures the inherent instability of the learning algorithm given the data distribution and sample size.
- We'd need to retrain the model on all possible datasets — an infinite process.
- We'd need access to the true function f(x)f(x)f(x) — impossible in real life.
- We'd need to sample infinite noise patterns — again impossible.
Solution: variance and bias are diagnosed indirectly by observing the model's performance.
We discuss this here:
⚠️ Important Note: Unbiased ≠ Best
- An estimator (statistic) is unbiased for a parameter if:
- An unbiased estimator might still be bad if its variance is huge — that is, the estimates are all over the place.
- Key to Understand: like a dart thrower whose arrows are wildly scattered (High Variance) but on average centered (Unbiased).
- He might miss the bullseye each time but the average landing spot is exactly at the center.
- We will use MSE to evaluate both bias and variance.
❗ Unbiased does not mean optimal. ✅ Optimal = lowest MSE, even if biased.
Variance Detailed
- This is just the formal writing of a standard deviation becoming variance —>
- Formally: The variance of a model's prediction at point is the expected squared difference between the prediction made by a model trained on a specific dataset D and the average prediction over all possible datasets D.
- Same multiple training dataset definitions as we did in Bias
- Conceptually: High variance is synonymous with high model sensitivity to the training data. It is sensitivity to fluctuations, noise within the training data ; small changes in the training set can lead to significantly different learned models and predictions.
- An overfit model fails to generalize because it has learned patterns that do not exist in the broader data distribution, mistaking noise for signal.
- A significant gap between low training error and high validation error is interpreted as a strong indicator of high variance.

Mean Squared Error (MSE)
- MSE is a criterion tries to take into account both concerns (Bias) and (Variance).
- MSE is a way to measure the goodness of the estimator.
- MSE is the average of squared errors.
Scenario: Choosing Between Two Estimators 🎭
- Imagine we have two models, which one would you prefer? Isn't "unbiased" always better?
Estimator Bias Variance Estimator A ✅ Small bias ✅ Small variance Estimator B ✅ Zero bias (Unbiased) ❌ Very high variance - Estimator A: Slightly biased but very stable.
- Estimator B: Unbiased but very unstable.
- Actually we would prefer Estimator A! Because small bias + low variance → often better than zero bias + huge variance.
- To make it easier, we actually need to minimize MSE!
Types of MSE
IMPORTANT: MSE is the Expected Squared Error — but Expected over what exactly?
In truth, there are several distinct flavors of theoretical MSE. Yet, in much of the machine learning literature, authors often refer to "MSE" without clearly specifying which version they mean.
- First Type of MSE: Fixed Single Test Input (Training Randomness Only) → We evaluate the performance of a model at a specific test point , by imagining many possible training datasets.
- Here, the randomness comes only from the training dataset.
- We ask: "If we trained on different datasets, how would the model behave at this exact ?"
- Second Type of MSE: Fixed Single Trained Model (Input Randomness Only) → We evaluate the performance of a single trained model, by measuring its error across different input points drawn from the data distribution.🔥This version is often used in practical machine learning, where we assume the model is fixed and test performance varies with the input data.
- Third Type of MSE: Full Expectation (Training and Input Randomness) → We evaluate a model’s performance over both sources of randomness
- This involves a double expectation: first over training datasets, then over input points.
The good news is that, regardless of which MSE flavor you are faced with, they all admit a decomposition into bias-squared, variance, and noise.
Derivation (Using Training Dataset Randomness)
- We have , so
- Rearrange the terms inside the parenthesis:
- Expand the square (a+b)^2 =a^2+2ab+b^2:
- Use the linearity of expectation
- Since , we have
- For the middle term: because the noise of point (x,y) is independent of training data and the function . Since is zero, so the whole term is ZERO.
- We are still remained with the piece
- We now (a-b)^2 = a^2-2ab+b^2 —>
- Use the linearty of expectation:
- We know that is a fixed value the (True function at x), so —>
- Add and Subtract —>
- Now the trick is we can create Bias^2 and Variance from step (d)
- is just which is the
- We are remained with which is the definition of
- All in all
5.aand6.e:
Why the irreducible error is the lower bound on the expected prediction error?
- Non-Negative Components (MSE):
- is always (≥0) because it's a squared value.
- Variance is always (≥0) by definition.
- Irreducible error has a variance which is also always (≥0). In most real-world problems, there's some noise, so .
- The Lower Limit: If Bias and Variance are zero, the expected minimum error (MSE) would satisfy:
No model, no matter how sophisticated or perfectly trained, can achieve an expected squared error lower than (at point x, averaged over datasets and noise).
Modern Machine Learning Twist 🧠 (Double Descent Phenomenon) (The end of the bias-variance trade-off?)
- Very large models (millions or billions of parameters) can still generalize well if trained carefully.
- This goes against the "classical" view where too much complexity must cause overfitting.
- The term
Double Descentwas coined in Dec2018inReconciling Modern ML Practice and the Bias–Variance Trade-off, then in Dec2019OpenAI published a paper and articleDeep Double Descentthat shows that effect in CNNs, ResNets, transformers and epoch-wise training.- Since then Dozens of papers (e.g. “
On the Role of Optimization in Double Descent”, “Double Descent Demystified”) extend the phenomenon to linear regression, random forests, kernel methods and analyze implicit regularization of SGD.
- Since then Dozens of papers (e.g. “
- Resources: , ,
- The
Second Descentreveals behavior of (extreme overparameterization) that is not typically analyzed classically. We observe strong test performance from very overfit, complex models. - It’s counterintuitive because you would expect DL training to be monotonic.
[1] The Classical Regime (Under-parameterized: ) (U-Curve): Underfitting — Sweet Spot — Overfitting
- Classical Tradeoff: increasing model complexity (e.g., adding parameters) decreases bias (the model becomes more flexible and can approximate the true function better) but increases variance (the model becomes more sensitive to the specific training points, including noise).
- Classical Tradeoff leads to the classical U-shaped curve for test error vs model complexity.
- The Blue Line (Test Error):
- Underfitting Region: Initially, increasing complexity improves test performance (lower bias).
- Sweet Spot (Optimal Balance): There's a balance where test error is minimized.
- Overfitting Region: Keep increasing complexity and test error starts to climb. (This is more cumbersome than underfitting)
- The Red Line (Train Error):
- It keeps decreasing until reaching the interpolation threshold.
- Interpolation Threshold: The Point where training error first becomes exactly zero. (High Variance)
- Interpolation Threshold: Means the models is so wiggly and the model has perfectly fit and memorized the entire training set passing through each datapoint.
- Interpolation Threshold: Mainly deals with the training error being ZERO, in this case the Test Error can be at its peak.

[2] What Happens at the Interpolation Threshold? ()
- Interpolation literal meaning here is that our model has interpolated between every single training point; It’s drawing a curve that runs through every single training point.
- Number of Parameters, Number of Training Samples
At the interpolation threshold, there can often be a single choice of model that works, and there is no reason to believe that model will be good for production.
- At this threshold we find , but why that makes the training error zero? Why do we have only one way to interpolate the data?
- Let’s use polynomial regression as an example:
- You have training datapoints .
- You have only input column / feature.
- To have , then we need coefficients in the model —> Which means a polynomial function of degree 4.
- Our model is OR
- So it’s like we have equations (How many samples) with unknowns coefficients (

- So (in general, unless the matrix is badly degenerate), one unique solution.
- Degenerate and Matrix Inverses are discussed here:
- This is why at the interpolation threshold we have only ONE SINGLE SOLUTION. That unique solution can be weird, wiggly, and overfit.
- In the next phase (Over-Parameterized Regime), we will have multiples of solutions to choose from.
- What if we have two features, how can we do Polynomial Regression Up to ?
- It’s so simple —>
- We expand to many polynomial terms (combined features), and adjust the coefficients over those expanded features.
- Terms are
powersandcross-productsof and .
[3] The Modern Interpolating Regime (Over-parameterized: )
- After the interpolation threshold, As the model complexity increases —> the curve does not continue to rise indefinitely, instead the second descent happens in the Test Error —> We get the Double Descent Curve!
- It’s the characteristic of many successful modern deep learning models.
- The Double Descent Phenomenon is not confined to a specific niche model or dataset.
- It has been proved to apply in linear regression, ridge regression, Neural Networks, Convolutional Neural Networks (CNNs), Residual Networks (ResNets), Transformers, Decision Trees and Ensemble Methods.
- It has been tested on various datasets, including standard benchmarks like CIFAR-10 and CIFAR-100 (often with added label noise to amplify the effect) , ImageNet , MNIST , as well as synthetic datasets (e.g., Gaussian data for theoretical analysis) and other real-world regression and classification datasets
The question is why does it work when ?
- Empirically بالتجربة this is well proven to happen, but theoretically why it happens is still an active area of research 😮
- Existing explanations often rely on simplified models (like linear regression or random features).
A quick simple analogy (IMPORTANT): You have 10 points to fit, and you have exactly 10 sticks (parameters).
Implicit Regularization and Optimization Bias
- This is the most central theme in explaining the second descent.
- After the interpolation threshold, every model onwards passes through each training data point.
- Now there are infinitely many models that can fit the training data perfectly.
- Some of these solutions are ugly, but some are beautiful:
- The only thing that changes is how the model connects the in-between points.
- As the models become more and more complex, these connections can become smoother, and the resulting prediction may fit your test data better.
- In general, it leans towards solutions of (Minimum L2 norm (simpler weights) zero-residual solution, Flatter minima (less sharp loss surface), Smoother functions).
1. At the interpolation threshold, there can often be a single choice of model that works.
2. In the limit of infinitely large models, there will be a vast number of interpolating models, and we can pick the best amongst them.
Factors Influencing Double Descent
The precise shape, peak location, and height of the double descent curve are not fixed but are influenced by the following factors (the model, the data, the training process, the regularizaiton, sparsity, and noise).
[1] Model Architecture and Size () Model-wise Double Descent
- This is the most frequently studied form —> varying the size of the model (often the network width) while keeping the dataset and training procedure fixed.
- Increasing is a necessity to move us to the Over-parameterized region.
- Increasing network width (Number of Neurons in Layer) often leads to the canonical double descent behavior.
- In contrast, increasing network depth (Number of Layers) beyond a certain point, while also increasing , has been observed in some studies (e.g., with ResNets) to worsen test performance monotonically.
- Note: Architectural Choices induce different implicit biases or optimization landscapes, not simply increasing .
[2] Dataset Size () Sample-wise Non-monotonicity
- Describes a state where more samples hurts the performance of the model.
- Increasing N shifts the double descent peak towards larger model sizes P.
- Interesting: If is fixed, increasing N can push the model away from the second descent towards the interpolation threshold —> This highlights the need to co-evolve model size and dataset size.
[3] Training Duration (Epochs) Epoch-wise Double Descent
- There is a regime where training longer reverses overfitting.
- Continue training even if overfitting (peak error), and potentially a second descent.
- Don’t do Early Stopping on Test / Validation dataset — It can prevent the double descent curve, making you stop near the first minimum.
- The existence of epoch-wise double descent complicates the standard practice of early stopping based solely on validation error increase

https://arxiv.org/pdf/1912.02292
[4] Regularization: Regularization-wise Double Descent
- Regularization = Any method that encourages the model to avoid overfitting and generalize better.
- Explicit Regularization: You add it manually (in loss or architecture)
- L2 weight decay (Ridge), Dropout, Early stopping, Data augmentation
- Implicit Regularization / Implicit Bias: It happens naturally from training dynamics
- The optimizer Choice (e.g., SGD, Adam, RMSprop), Initial weights, Batch Normalization, Residual Connections
- Explicit Regularization is a factor because it can flatten or remove the harmful test error peak at the interpolation threshold, Making the transition from under- to overparameterized smoother
- Implicit Regularization is a factor because this implicit bias guides the selection from the many interpolating solutions in the P>N regime, favoring smooth curves and low-norm weights.
[5] Sparsity:
- Sparsity means: many parameters (weights) are zero —> You "prune" (remove) neurons, weights, or filters from a network after training (or even during training).
- Intuition says: fewer parameters = lower capacity → less overfitting, more generalization.
- In Double Descent, Sparsity is less common mentioned, but surprisingly it follows different approach: Worsen —> Improve —> Worsen
- When you gradually increase sparsity in a trained model:
Stage Behavior Why Low sparsity (small pruning) Test performance worsens first Maybe you're deleting useful redundancies, disrupting the network's fine-tuned balance. Small amounts of pruning can break important structures inside the network. Moderate sparsity Test performance improves Now sparsity acts like strong regularization, suppressing overfitting, letting true patterns dominate. regularizes nicely. Extreme sparsity Test performance worsens again Now you're underfitting — model can't even represent the true signal anymore. damages the ability to learn at all (underfitting).
[6] Data Noise Levels
- Label noise = some training examples have the wrong labels. (For example, an image of a "cat" is mistakenly labeled "dog.")
- The model doesn’t know which labels are wrong — it tries to fit everything.
- With noisy labels, to achieve perfect fit (zero training error), the model is forced to contort itself — bend unnaturally — to also fit the wrong labels.
- Even without artificial noise, real-world complex datasets (like
CIFAR-10,ImageNet) naturally have label noise or label ambiguity. Thus, even clean-looking datasets can exhibit double descent — but adding noise makes the effect much more dramatic and easier to observe. - Label Noise in training data amplifies the peak of the double descent curve, because the model is now worse on predicting test data, and just memorized wrong data.
Factors Influencing Double Descent (DD) Characteristics
| Factor | Typical Effect on DD Curve | Notes |
|---|---|---|
| Model Size (Width ↑) | Increasing the model complexity, especially width (Number of Nodes per Layer) moves along the x-axis; so we observe the full DD. | |
| Model Size (Depth ↑) | Same as Width, but may degrade performance beyond a point; therefore second descent not guaranteed | Behavior is architecture-dependent (e.g., ResNets often handle depth better) |
| Dataset Size (N ↑) | Shifts peak to higher P, because more P needed until interpolation; | It changes location of interpolation threshold (P≈N) and can temporarily increase error if P is fixed near critical regime |
| Training Epochs ↑ | Traverses epoch-wise DD curve; reverses overfitting in large models. | |
| Explicit Regularization ↑ (e.g., L2, Dropout) | Reduces/eliminates peak height; smooths curve | This is because these methods are in the first place to prevent overfitting, so we should not have interpolation point at all, therefore no DD, but monotonic smooth curve. |
| Implicit Optimization (Optimizer Choice) | Alters curve shape via different implicit biases | Different optimizers (e.g., SGD vs. Adam) favor different solution regions. Impacts generalization and DD visibility. |
| Sparsity ↑ | Produces non-monotonic “sparse DD”: error ↑ then ↓ then ↑ again | Initially hurts (damages structure), then helps (regularization), finally harms again (underfitting). |
| Label Noise ↑ | Amplifies the interpolation peak; worsens test error near threshold | Forces models near 𝑃≈𝑁 to fit incorrect labels, yielding unstable, poor solutions. Makes DD behavior more pronounced. |
🔥 How we indirectly detect (diagnose) Bias and Variance in practice
- Training vs Validation Learning Curves: Plotting the
model's performance(e.g., loss or accuracy) vs.training time(epochs) ordataset size.- High Bias:
- Both curves have high error level. They are close together, but the error is unacceptably high. Increasing training time or data doesn't help if the model is simple.
- High Variance:
- There's a significant gap between the two curves.
- One way to decrease variance would be to add more training data. However, there is a price to pay: we will have less test data.
- Good Fit: Both converge at a low error level with only a small gap between them.
- High Bias:

- Error obtained from Training Data.
- You train and test on same data to get the Best Case Scenario. It’s optimistic measure.
- If Low Resubstitution Error + High Test Error —> Overfitting (High Variance)
- If High Resubstitution Error + High Test Error —> Underfitting (High Bias)
- Resubstitution Error is Upper bound on bias, lower bound on variance —> It tells you little about how predictions will fluctuate on new data; variance is invisible here.
- Model Complexity:
- Very simple model (linear regression for a complex problem, shallow neural network, low-degree polynomial) and seeing poor performance everywhere, high bias is a likely suspect.
- Very complex model (very deep network, high-degree polynomial, decision tree with no depth limit) without sufficient regularization, and a large gap between training and validation performance, then high variance.
- Cross-Validation Results:
- If performance is consistently poor across all folds of cross-validation, it points towards high bias.
- If performance varies drastically between different folds (very good on some, poor on others), it suggests sensitivity to training data subset, indicating high variance.
- More details about it here:
- ✅ Always evaluate across multiple seeds or folds.
🔥 Managing Bias and Variance: Techniques and Strategies
Techniques to Manage Bias
- Use More Complex Models: Switch to a More Powerful Model or Increase Complexity within the Model
- Decrease Regularization:
- Regularization methods (like L1, L2, dropout) are primarily used to combat overfitting by penalizing model complexity. If a model is underfitting, it might be because the regularization is too strong.
- It gives the model more freedom to fit the data. Removing regularization entirely might also be considered.
- Train Longer or Increase Training Data (Use with Caution)
- For iterative algorithms like neural networks trained with gradient descent, underfitting might occur if the training process is stopped too early before the model has converged.
- Learning curves are essential to look at.
- Feature Engineering / Embeddings: The input features do not contain enough information to predict the target accurately
- Replace Static
word2vecwith Context-AwareBERT/RoBERTa/LLMembeddings - Train embeddings jointly with task: Don’t freeze pre-trained embeddings; let them adapt to your task
- Collect or derive additional relevant features that capture more aspects of the problem. Domain expertise is often crucial here.
- Create new features from existing ones, such as interaction terms (products of features)
- Replace Static
- Boosting: Trains models sequentially, with each new model focusing on correcting the errors made by the previous ones.
- Examples include AdaBoost, Gradient Boosting Machines (GBM), XGBoost, LightGBM.
- Each new model tries to fix the errors (biases) of the previous one.
- While primarily aimed at reducing bias by combining weak learners, variance can increase, especially if base learners are too deep or too many rounds are used.
- Again, it might increase variance if it keeps fitting harder and harder examples.
Techniques to Manage Variance
- Feature Selection / Dimensionality Reduction:
- Overfitting can occur if the model uses too many features, especially irrelevant or noisy ones.
- Feature Selection: Identify and remove features that have little predictive power or are redundant. Techniques like using L1 regularization or statistical tests can aid selection.
- Dimensionality Reduction: Algorithms like PCA that project the data onto a lower-dimensional space while preserving most of the variance.
- Regularization: L1/L2 penalties, weight decay, dropout, attention dropout
- More about it in Supervised Machine Learning Course and Deep Learning Course
- In Linear Regression, when we have it’s a good model (perfect), but if we got we will need to decrease and (Penalize them), so regularize)
- Early Stopping: Stop training before the model memorizes noise
- This was the case before discovering Double Descent.
- Increase Training Data
- Often the most effective way to combat overfitting.
- More data provides a clearer picture of the underlying patterns and makes it harder for the model to fit random noise specific to a small sample.
- Data Augmentation: If acquiring more real data is difficult, artificially expand the training set.
- E.g., rotating/cropping images, adding slight noise, paraphrasing text
- Reduce Model Complexity:
- For neural networks, reduce the number of layers or neurons. For decision trees, prune the tree (limit depth or number of leaves). For polynomial regression, reduce the degree. Use smaller kernels in SVMs.
- Ensembling (Averaging): Combine outputs from multiple trained models (e.g., random forests, deep ensembles), often significantly reducing variance
- Learn more here:
- Bagging (Bootstrap Aggregating): Trains multiple instances of a base learner (often complex ones like decision trees) on different bootstrap samples (random samples with replacement) of the training data and averages their predictions. Random Forests are a prime example. Bagging primarily reduces variance.
- Stacking (Stacked Generalization): Trains multiple different types of base models and uses another model (a meta-learner) to learn how to best combine their predictions. Aims to leverage the diverse strengths of different algorithms. It works for bias and variance, depending on meta-lerner.
- Regularization, while primarily targeting high variance , inherently introduces some bias by constraining the model; finding the right regularization strength is key. Conversely, reducing regularization to combat high bias can increase variance.
- Feature selection reduces model complexity and thus variance , but removing features that hold valuable information, even if subtle, can increase bias.
Other Losses - Decomposition of Bias and Variance
- While the neat algebraic decomposition of MSE into Bias and Variance doesn't directly carry over to all other loss functions, but the bias and variance and their trade-off, still absolutely apply.
- The loss functions are defined differently, so the mathematical steps used to decompose the expected error simply don't result in the same clean, additive structure.
- But because in ML we focus on Systematic Error (High Bias) or Sensitivity to Data (High Variance) regarding of the decomposition, the tradeoff still apply.
- Regardless of the loss definition, we detect and diagnose Bias and Variance here:
