Bias and Variance - Machine Learning

When building a machine learning model, we want it to learn the actual pattern in the data and make accurate predictions on unseen data. However, models can make errors due to two major reasons:

Bias
Variance

Understanding these concepts is essential because they help explain underfitting, overfitting, and the bias-variance tradeoff.

What is Bias?

Bias is the error caused when a model is too simple to capture the underlying pattern in the data.

A high-bias model makes strong assumptions and fails to learn the actual relationship between input and output variables.

In simple words:

Bias measures how far the model's predictions are from the true pattern.

High bias leads to:

Underfitting

What is Variance?

Variance is the error caused when a model becomes too sensitive to the training data.

A high-variance model learns not only the actual pattern but also noise and small fluctuations present in the training dataset.

In simple words:

Variance measures how much the model changes when trained on different datasets.

High variance leads to:

Overfitting

Example Dataset

Consider the following dataset:

Study Hours	Marks
1	10
2	20
3	30
4	40
5	50
6	60

The actual relationship is:

Marks = 10 × Study Hours

The data follows a clear increasing trend.

Case 1: High Bias (Underfitting)

Suppose a model predicts:

Marks = 35

for every student.

Study Hours	Actual Marks	Predicted Marks
1	10	35
2	20	35
3	30	35
4	40	35
5	50	35
6	60	35

Graphically:

60 |                    *
50 |                *
40 |            *
35 |---------------------- Model
30 |        *
20 |    *
10 | *
   +-------------------------
     1  2  3  4  5  6

What happened?

The model completely ignores the increasing trend.

It assumes:

Everyone gets 35 marks

The model is too simple to learn the pattern.

Training Error:

High

Testing Error:

High

This situation is called:

High Bias
Underfitting

Case 2: Balanced Model (Good Generalization)

Now suppose the model learns:

Marks = 10 × Study Hours

Predictions:

Study Hours	Actual Marks	Predicted Marks
1	10	10
2	20	20
3	30	30
4	40	40
5	50	50
6	60	60

Graph:

60 |                    *
50 |                *
40 |            *
30 |        *
20 |    *
10 | *
   +-------------------------
     1  2  3  4  5  6

The model correctly learns the relationship between study hours and marks.

Training Error:

Low

Testing Error:

Low

This is the ideal situation.

Case 3: High Variance (Overfitting)

Now suppose the training data contains some noise.

Study Hours	Marks
1	10
2	18
3	33
4	38
5	55
6	59

An overfitted model tries to memorize every training point exactly.

Graph:

60 |                   *
55 |              *
50 |
45 |
40 |          *
35 |      *
30 |
25 |
20 |   *
15 |
10 | *
   +-------------------------

Instead of learning:

More Study Hours → Higher Marks

it learns:

1 hour → 10
2 hours → 18
3 hours → 33
4 hours → 38
5 hours → 55
6 hours → 59

exactly.

Training Error:

Very Low (Almost Zero)

Testing Error:

High

because new students will not follow these exact values.

This situation is called:

High Variance
Overfitting

High Bias vs High Variance

Aspect	High Bias	High Variance
Model Complexity	Too Simple	Too Complex
Learning Ability	Learns too little	Learns too much
Training Error	High	Very Low
Testing Error	High	High
Main Problem	Underfitting	Overfitting

Bias-Variance Tradeoff

In machine learning, reducing bias often increases variance, while reducing variance may increase bias.

This balance is called the:

Bias-Variance Tradeoff

Too Simple Model
       ↓
High Bias
       ↓
Underfitting

Too Complex Model
       ↓
High Variance
       ↓
Overfitting

Balanced Model
       ↓
Good Generalization

The goal is to achieve:

Low Bias
Low Variance

so that the model performs well on both training and unseen data.

Analogy

Imagine a teacher evaluating students.

High Bias Teacher

The teacher says:

Everyone gets 35 marks.

This is too simplistic and inaccurate.

High Variance Teacher

The teacher memorizes every student's previous score and cannot evaluate new students properly.

Good Teacher

The teacher understands the general relationship:

More study usually leads to better marks.

and can evaluate new students accurately.

This is how a good machine learning model behaves.

How to Reduce Bias

To reduce high bias:

Use a more powerful model
Add useful features
Increase model complexity
Improve feature engineering
Reduce excessive regularization

How to Reduce Variance

To reduce high variance:

Use more training data
Apply regularization
Use cross-validation
Reduce model complexity
Use ensemble methods such as Random Forest
Prune decision trees

Important Points

Bias is the error caused by oversimplified assumptions.
High bias leads to underfitting.
Variance is the error caused by excessive sensitivity to training data.
High variance leads to overfitting.
A good model balances bias and variance.
Training error and testing error help identify bias and variance issues.
Bias-Variance Tradeoff is one of the most important concepts in machine learning.
The ultimate goal is good generalization on unseen data.

Keywords

Bias in Machine Learning, Variance in Machine Learning, Bias Variance Tradeoff, Underfitting, Overfitting, Model Generalization, Training Error, Testing Error, Regularization, Machine Learning Model Performance