Bias and Variance
When building a machine learning model, we want it to learn the actual pattern in the data and make accurate predictions on unseen data. However, models can make errors due to two major reasons:
-
Bias
-
Variance
Understanding these concepts is essential because they help explain underfitting, overfitting, and the bias-variance tradeoff.
What is Bias?
Bias is the error caused when a model is too simple to capture the underlying pattern in the data.
A high-bias model makes strong assumptions and fails to learn the actual relationship between input and output variables.
In simple words:
Bias measures how far the model's predictions are from the true pattern.
High bias leads to:
Underfitting
What is Variance?
Variance is the error caused when a model becomes too sensitive to the training data.
A high-variance model learns not only the actual pattern but also noise and small fluctuations present in the training dataset.
In simple words:
Variance measures how much the model changes when trained on different datasets.
High variance leads to:
Overfitting
Example Dataset
Consider the following dataset:
| Study Hours | Marks |
|---|---|
| 1 | 10 |
| 2 | 20 |
| 3 | 30 |
| 4 | 40 |
| 5 | 50 |
| 6 | 60 |
The actual relationship is:
Marks = 10 × Study Hours
The data follows a clear increasing trend.
Case 1: High Bias (Underfitting)
Suppose a model predicts:
Marks = 35
for every student.
| Study Hours | Actual Marks | Predicted Marks |
|---|---|---|
| 1 | 10 | 35 |
| 2 | 20 | 35 |
| 3 | 30 | 35 |
| 4 | 40 | 35 |
| 5 | 50 | 35 |
| 6 | 60 | 35 |
Graphically:
60 | *
50 | *
40 | *
35 |---------------------- Model
30 | *
20 | *
10 | *
+-------------------------
1 2 3 4 5 6
What happened?
The model completely ignores the increasing trend.
It assumes:
Everyone gets 35 marks
The model is too simple to learn the pattern.
Training Error:
High
Testing Error:
High
This situation is called:
High Bias
Underfitting
Case 2: Balanced Model (Good Generalization)
Now suppose the model learns:
Marks = 10 × Study Hours
Predictions:
| Study Hours | Actual Marks | Predicted Marks |
|---|---|---|
| 1 | 10 | 10 |
| 2 | 20 | 20 |
| 3 | 30 | 30 |
| 4 | 40 | 40 |
| 5 | 50 | 50 |
| 6 | 60 | 60 |
Graph:
60 | *
50 | *
40 | *
30 | *
20 | *
10 | *
+-------------------------
1 2 3 4 5 6
The model correctly learns the relationship between study hours and marks.
Training Error:
Low
Testing Error:
Low
This is the ideal situation.
Case 3: High Variance (Overfitting)
Now suppose the training data contains some noise.
| Study Hours | Marks |
|---|---|
| 1 | 10 |
| 2 | 18 |
| 3 | 33 |
| 4 | 38 |
| 5 | 55 |
| 6 | 59 |
An overfitted model tries to memorize every training point exactly.
Graph:
60 | *
55 | *
50 |
45 |
40 | *
35 | *
30 |
25 |
20 | *
15 |
10 | *
+-------------------------
Instead of learning:
More Study Hours → Higher Marks
it learns:
1 hour → 10
2 hours → 18
3 hours → 33
4 hours → 38
5 hours → 55
6 hours → 59
exactly.
Training Error:
Very Low (Almost Zero)
Testing Error:
High
because new students will not follow these exact values.
This situation is called:
High Variance
Overfitting
High Bias vs High Variance
| Aspect | High Bias | High Variance |
|---|---|---|
| Model Complexity | Too Simple | Too Complex |
| Learning Ability | Learns too little | Learns too much |
| Training Error | High | Very Low |
| Testing Error | High | High |
| Main Problem | Underfitting | Overfitting |
Bias-Variance Tradeoff
In machine learning, reducing bias often increases variance, while reducing variance may increase bias.
This balance is called the:
Bias-Variance Tradeoff
Too Simple Model
↓
High Bias
↓
Underfitting
Too Complex Model
↓
High Variance
↓
Overfitting
Balanced Model
↓
Good Generalization
The goal is to achieve:
Low Bias
Low Variance
so that the model performs well on both training and unseen data.
Analogy
Imagine a teacher evaluating students.
High Bias Teacher
The teacher says:
Everyone gets 35 marks.
This is too simplistic and inaccurate.
High Variance Teacher
The teacher memorizes every student's previous score and cannot evaluate new students properly.
Good Teacher
The teacher understands the general relationship:
More study usually leads to better marks.
and can evaluate new students accurately.
This is how a good machine learning model behaves.
How to Reduce Bias
To reduce high bias:
-
Use a more powerful model
-
Add useful features
-
Increase model complexity
-
Improve feature engineering
-
Reduce excessive regularization
How to Reduce Variance
To reduce high variance:
-
Use more training data
-
Apply regularization
-
Use cross-validation
-
Reduce model complexity
-
Use ensemble methods such as Random Forest
-
Prune decision trees
Important Points
-
Bias is the error caused by oversimplified assumptions.
-
High bias leads to underfitting.
-
Variance is the error caused by excessive sensitivity to training data.
-
High variance leads to overfitting.
-
A good model balances bias and variance.
-
Training error and testing error help identify bias and variance issues.
-
Bias-Variance Tradeoff is one of the most important concepts in machine learning.
-
The ultimate goal is good generalization on unseen data.
Keywords
Bias in Machine Learning, Variance in Machine Learning, Bias Variance Tradeoff, Underfitting, Overfitting, Model Generalization, Training Error, Testing Error, Regularization, Machine Learning Model Performance