fit_transform and transform

Understanding fit(), transform(), and fit_transform() in Feature Scaling

Feature Scaling is an important preprocessing step in Machine Learning. Some algorithms such as:

  • SVR

  • KNN

  • SVM

  • Logistic Regression

are sensitive to the scale of data.

To solve this problem, we use: StandardScaler()

StandardScaler converts data into a standard format where:

  • Mean = 0

  • Standard Deviation = 1

Example Dataset

Suppose we have the following data:

X

1

2

3

4

5

We want to scale these values.

StandardScaler Formula

The scaling formula is:

Scaled Value = (X - Mean) / Standard Deviation

Where:

  • X = Original value

  • Mean = Average of data

  • Standard Deviation = Spread of data

Step 1: Understanding fit()

Suppose we use:

scaler.fit(X)

The fit() method:

  • studies the data

  • calculates mean

  • calculates standard deviation

It does NOT scale the data.

Mean Calculation

Dataset:

1, 2, 3, 4, 5

Formula:

Mean = Sum of values / Number of values

Calculation:

Mean = (1 + 2 + 3 + 4 + 5) / 5

Mean = 15 / 5

Mean = 3

Standard Deviation Calculation

Formula:

Std = √(Σ(X - Mean)² / n)

X

X - Mean

(X - Mean)²

1

-2

4

2

-1

1

3

0

0

4

1

1

5

2

4

Sum:

4 + 1 + 0 + 1 + 4 = 10

Variance:

10 / 5 = 2

Standard deviation:

√2 = 1.414

So after fit():

Mean = 3 Standard Deviation = 1.414

The scaler stores these values internally.

Step 2: Understanding transform()

Now suppose we use:

scaler.transform(X)

The transform() method:

  • uses the mean and standard deviation learned during fit()

  • applies scaling formula to every value

Scaling Calculation

Formula:

Scaled Value = (X - Mean) / Std

For X = 1

(1 - 3) / 1.414

-2 / 1.414

-1.414

For X = 2

(2 - 3) / 1.414

-0.707

For X = 3

(3 - 3) / 1.414

0

For X = 4

(4 - 3) / 1.414

0.707

For X = 5

(5 - 3) / 1.414

1.414

Final Scaled Values

Original Value

Scaled Value

1

-1.414

2

-0.707

3

0

4

0.707

5

1.414

Step 3: Understanding fit_transform()

Suppose we use:

scaler.fit_transform(X)

This performs BOTH operations together:

  1. fit()

    • Learns mean and standard deviation

  2. transform()

    • Applies scaling formula

So:

fit_transform() = fit() + transform()

Why fit_transform() is Used During Training

During training:

  • The scaler must first learn from data

  • Then scale the same data

So we use: fit_transform()

Why Only transform() is Used During Testing

Suppose new testing data:

new_X = [[6]]

We use:

scaler.transform(new_X)

Calculation:

(6 - 3) / 1.414

2.121

Here:

  • Mean = 3

  • Std = 1.414

are reused from training data.

Why NOT fit_transform() on Testing Data?

If we use:

fit_transform(new_X)

then:

  • New mean becomes 6

  • Standard deviation becomes 0

This changes the scaling completely and creates inconsistent predictions.

Machine Learning models require:

Same scaling during training and testing

So:

  • Training data uses fit_transform()

  • Testing/new data uses transform()

Python Program Example

import pandas as pd
from sklearn.preprocessing import StandardScaler

# Original dataset
X = pd.DataFrame({
    "Values": [1, 2, 3, 4, 5]
})

print("Original Data:")
print(X)

# Create scaler object
scaler = StandardScaler()

# fit_transform() on training data
X_scaled = scaler.fit_transform(X)

print("\nScaled Training Data:")
print(X_scaled)

# New testing data
new_X = [[6]]

# transform() on testing data
new_X_scaled = scaler.transform(new_X)

print("\nScaled New Data:")
print(new_X_scaled)


Expected Output

Original Data: Values 0 1 1 2 2 3 3 4 4 5

Scaled Training Data: [[-1.414] [-0.707] [ 0. ] [ 0.707] [ 1.414]]

Scaled New Data: [[2.121]]

Final Understanding

Method

Purpose

fit()

Learns mean and standard deviation

transform()

Applies scaling

fit_transform()

Learns and applies scaling together

Training data uses:

fit_transform()

Testing/new data uses:

transform()

Previous Topic Insert Documents in MongoDB using Node.js