Bias and Variance

When building a machine learning model, we want it to learn the actual pattern in the data and make accurate predictions on unseen data. However, models can make errors due to two major reasons:

  • Bias

  • Variance

Understanding these concepts is essential because they help explain underfitting, overfitting, and the bias-variance tradeoff.

What is Bias?

Bias is the error caused when a model is too simple to capture the underlying pattern in the data.

A high-bias model makes strong assumptions and fails to learn the actual relationship between input and output variables.

In simple words:

Bias measures how far the model's predictions are from the true pattern.

High bias leads to:

Underfitting

What is Variance?

Variance is the error caused when a model becomes too sensitive to the training data.

A high-variance model learns not only the actual pattern but also noise and small fluctuations present in the training dataset.

In simple words:

Variance measures how much the model changes when trained on different datasets.

High variance leads to:

Overfitting

Example Dataset

Consider the following dataset:

Study Hours Marks
1 10
2 20
3 30
4 40
5 50
6 60

The actual relationship is:

Marks = 10 × Study Hours

The data follows a clear increasing trend.

Case 1: High Bias (Underfitting)

Suppose a model predicts:

Marks = 35

for every student.

Study Hours Actual Marks Predicted Marks
1 10 35
2 20 35
3 30 35
4 40 35
5 50 35
6 60 35

Graphically:

60 |                    *
50 | *
40 | *
35 |---------------------- Model
30 | *
20 | *
10 | *
+-------------------------
1 2 3 4 5 6

What happened?

The model completely ignores the increasing trend.

It assumes:

Everyone gets 35 marks

The model is too simple to learn the pattern.

Training Error:

High

Testing Error:

High

This situation is called:

High Bias
Underfitting

Case 2: Balanced Model (Good Generalization)

Now suppose the model learns:

Marks = 10 × Study Hours

Predictions:

Study Hours Actual Marks Predicted Marks
1 10 10
2 20 20
3 30 30
4 40 40
5 50 50
6 60 60

Graph:

60 |                    *
50 | *
40 | *
30 | *
20 | *
10 | *
+-------------------------
1 2 3 4 5 6

The model correctly learns the relationship between study hours and marks.

Training Error:

Low

Testing Error:

Low

This is the ideal situation.

Case 3: High Variance (Overfitting)

Now suppose the training data contains some noise.

Study Hours Marks
1 10
2 18
3 33
4 38
5 55
6 59

An overfitted model tries to memorize every training point exactly.

Graph:

60 |                   *
55 | *
50 |
45 |
40 | *
35 | *
30 |
25 |
20 | *
15 |
10 | *
+-------------------------

Instead of learning:

More Study Hours → Higher Marks

it learns:

1 hour → 10
2 hours → 18
3 hours → 33
4 hours → 38
5 hours → 55
6 hours → 59

exactly.

Training Error:

Very Low (Almost Zero)

Testing Error:

High

because new students will not follow these exact values.

This situation is called:

High Variance
Overfitting

High Bias vs High Variance

Aspect High Bias High Variance
Model Complexity Too Simple Too Complex
Learning Ability Learns too little Learns too much
Training Error High Very Low
Testing Error High High
Main Problem Underfitting Overfitting

Bias-Variance Tradeoff

In machine learning, reducing bias often increases variance, while reducing variance may increase bias.

This balance is called the:

Bias-Variance Tradeoff

Too Simple Model

High Bias

Underfitting

Too Complex Model

High Variance

Overfitting

Balanced Model

Good Generalization

The goal is to achieve:

Low Bias
Low Variance

so that the model performs well on both training and unseen data.

Analogy

Imagine a teacher evaluating students.

High Bias Teacher

The teacher says:

Everyone gets 35 marks.

This is too simplistic and inaccurate.

High Variance Teacher

The teacher memorizes every student's previous score and cannot evaluate new students properly.

Good Teacher

The teacher understands the general relationship:

More study usually leads to better marks.

and can evaluate new students accurately.

This is how a good machine learning model behaves.

How to Reduce Bias

To reduce high bias:

  • Use a more powerful model

  • Add useful features

  • Increase model complexity

  • Improve feature engineering

  • Reduce excessive regularization

How to Reduce Variance

To reduce high variance:

  • Use more training data

  • Apply regularization

  • Use cross-validation

  • Reduce model complexity

  • Use ensemble methods such as Random Forest

  • Prune decision trees

Important Points

  • Bias is the error caused by oversimplified assumptions.

  • High bias leads to underfitting.

  • Variance is the error caused by excessive sensitivity to training data.

  • High variance leads to overfitting.

  • A good model balances bias and variance.

  • Training error and testing error help identify bias and variance issues.

  • Bias-Variance Tradeoff is one of the most important concepts in machine learning.

  • The ultimate goal is good generalization on unseen data.

Keywords

Bias in Machine Learning, Variance in Machine Learning, Bias Variance Tradeoff, Underfitting, Overfitting, Model Generalization, Training Error, Testing Error, Regularization, Machine Learning Model Performance

Previous Topic Count Vectorizer Next Topic ML Projects