fit_transform and transform - Blogs

Understanding fit(), transform(), and fit_transform() in Feature Scaling

Feature Scaling is an important preprocessing step in Machine Learning. Some algorithms such as:

SVR
KNN
SVM
Logistic Regression

are sensitive to the scale of data.

To solve this problem, we use: StandardScaler()

StandardScaler converts data into a standard format where:

Mean = 0
Standard Deviation = 1

Example Dataset

Suppose we have the following data:

X
1
2
3
4
5

We want to scale these values.

StandardScaler Formula

The scaling formula is:

Scaled Value = (X - Mean) / Standard Deviation

Where:

X = Original value
Mean = Average of data
Standard Deviation = Spread of data

Step 1: Understanding fit()

Suppose we use:

scaler.fit(X)

The fit() method:

studies the data
calculates mean
calculates standard deviation

It does NOT scale the data.

Mean Calculation

Dataset:

1, 2, 3, 4, 5

Formula:

Mean = Sum of values / Number of values

Calculation:

Mean = (1 + 2 + 3 + 4 + 5) / 5

Mean = 15 / 5

Mean = 3

Standard Deviation Calculation

Formula:

Std = √(Σ(X - Mean)² / n)

X	X - Mean	(X - Mean)²
1	-2	4
2	-1	1
3	0	0
4	1	1
5	2	4

Sum:

4 + 1 + 0 + 1 + 4 = 10

Variance:

10 / 5 = 2

Standard deviation:

√2 = 1.414

So after fit():

Mean = 3 Standard Deviation = 1.414

The scaler stores these values internally.

Step 2: Understanding transform()

Now suppose we use:

scaler.transform(X)

The transform() method:

uses the mean and standard deviation learned during fit()
applies scaling formula to every value

Scaling Calculation

Formula:

Scaled Value = (X - Mean) / Std

For X = 1

(1 - 3) / 1.414

-2 / 1.414

-1.414

For X = 2

(2 - 3) / 1.414

-0.707

For X = 3

(3 - 3) / 1.414

For X = 4

(4 - 3) / 1.414

0.707

For X = 5

(5 - 3) / 1.414

1.414

Final Scaled Values

Original Value	Scaled Value
1	-1.414
2	-0.707
3	0
4	0.707
5	1.414

Step 3: Understanding fit_transform()

Suppose we use:

scaler.fit_transform(X)

This performs BOTH operations together:

fit()
- Learns mean and standard deviation
transform()
- Applies scaling formula

So:

fit_transform() = fit() + transform()

Why fit_transform() is Used During Training

During training:

The scaler must first learn from data
Then scale the same data

So we use: fit_transform()

Why Only transform() is Used During Testing

Suppose new testing data:

new_X = [[6]]

We use:

scaler.transform(new_X)

Calculation:

(6 - 3) / 1.414

2.121

Here:

Mean = 3
Std = 1.414

are reused from training data.

Why NOT fit_transform() on Testing Data?

If we use:

fit_transform(new_X)

then:

New mean becomes 6
Standard deviation becomes 0

This changes the scaling completely and creates inconsistent predictions.

Machine Learning models require:

Same scaling during training and testing

So:

Training data uses fit_transform()
Testing/new data uses transform()

Python Program Example

import pandas as pd
from sklearn.preprocessing import StandardScaler

# Original dataset
X = pd.DataFrame({
    "Values": [1, 2, 3, 4, 5]
})

print("Original Data:")
print(X)

# Create scaler object
scaler = StandardScaler()

# fit_transform() on training data
X_scaled = scaler.fit_transform(X)

print("\nScaled Training Data:")
print(X_scaled)

# New testing data
new_X = [[6]]

# transform() on testing data
new_X_scaled = scaler.transform(new_X)

print("\nScaled New Data:")
print(new_X_scaled)

Expected Output

Original Data: Values 0 1 1 2 2 3 3 4 4 5

Scaled Training Data: [[-1.414] [-0.707] [ 0. ] [ 0.707] [ 1.414]]

Scaled New Data: [[2.121]]

Final Understanding

Method	Purpose
fit()	Learns mean and standard deviation
transform()	Applies scaling
fit_transform()	Learns and applies scaling together

Training data uses:

fit_transform()

Testing/new data uses:

transform()