fit_transform and transform
Understanding fit(), transform(), and fit_transform() in Feature Scaling
Feature Scaling is an important preprocessing step in Machine Learning. Some algorithms such as:
-
SVR
-
KNN
-
SVM
-
Logistic Regression
are sensitive to the scale of data.
To solve this problem, we use: StandardScaler()
StandardScaler converts data into a standard format where:
-
Mean = 0
-
Standard Deviation = 1
Example Dataset
Suppose we have the following data:
|
X |
|---|
|
1 |
|
2 |
|
3 |
|
4 |
|
5 |
We want to scale these values.
StandardScaler Formula
The scaling formula is:
Scaled Value = (X - Mean) / Standard Deviation
Where:
-
X = Original value
-
Mean = Average of data
-
Standard Deviation = Spread of data
Step 1: Understanding fit()
Suppose we use:
scaler.fit(X)
The fit() method:
-
studies the data
-
calculates mean
-
calculates standard deviation
It does NOT scale the data.
Mean Calculation
Dataset:
1, 2, 3, 4, 5
Formula:
Mean = Sum of values / Number of values
Calculation:
Mean = (1 + 2 + 3 + 4 + 5) / 5
Mean = 15 / 5
Mean = 3
Standard Deviation Calculation
Formula:
Std = √(Σ(X - Mean)² / n)
|
X |
X - Mean |
(X - Mean)² |
|---|---|---|
|
1 |
-2 |
4 |
|
2 |
-1 |
1 |
|
3 |
0 |
0 |
|
4 |
1 |
1 |
|
5 |
2 |
4 |
Sum:
4 + 1 + 0 + 1 + 4 = 10
Variance:
10 / 5 = 2
Standard deviation:
√2 = 1.414
So after fit():
Mean = 3 Standard Deviation = 1.414
The scaler stores these values internally.
Step 2: Understanding transform()
Now suppose we use:
scaler.transform(X)
The transform() method:
-
uses the mean and standard deviation learned during fit()
-
applies scaling formula to every value
Scaling Calculation
Formula:
Scaled Value = (X - Mean) / Std
For X = 1
(1 - 3) / 1.414
-2 / 1.414
-1.414
For X = 2
(2 - 3) / 1.414
-0.707
For X = 3
(3 - 3) / 1.414
0
For X = 4
(4 - 3) / 1.414
0.707
For X = 5
(5 - 3) / 1.414
1.414
Final Scaled Values
|
Original Value |
Scaled Value |
|---|---|
|
1 |
-1.414 |
|
2 |
-0.707 |
|
3 |
0 |
|
4 |
0.707 |
|
5 |
1.414 |
Step 3: Understanding fit_transform()
Suppose we use:
scaler.fit_transform(X)
This performs BOTH operations together:
-
fit()
-
Learns mean and standard deviation
-
-
transform()
-
Applies scaling formula
-
So:
fit_transform() = fit() + transform()
Why fit_transform() is Used During Training
During training:
-
The scaler must first learn from data
-
Then scale the same data
So we use: fit_transform()
Why Only transform() is Used During Testing
Suppose new testing data:
new_X = [[6]]
We use:
scaler.transform(new_X)
Calculation:
(6 - 3) / 1.414
2.121
Here:
-
Mean = 3
-
Std = 1.414
are reused from training data.
Why NOT fit_transform() on Testing Data?
If we use:
fit_transform(new_X)
then:
-
New mean becomes 6
-
Standard deviation becomes 0
This changes the scaling completely and creates inconsistent predictions.
Machine Learning models require:
Same scaling during training and testing
So:
-
Training data uses fit_transform()
-
Testing/new data uses transform()
Python Program Example
import pandas as pd
from sklearn.preprocessing import StandardScaler
# Original dataset
X = pd.DataFrame({
"Values": [1, 2, 3, 4, 5]
})
print("Original Data:")
print(X)
# Create scaler object
scaler = StandardScaler()
# fit_transform() on training data
X_scaled = scaler.fit_transform(X)
print("\nScaled Training Data:")
print(X_scaled)
# New testing data
new_X = [[6]]
# transform() on testing data
new_X_scaled = scaler.transform(new_X)
print("\nScaled New Data:")
print(new_X_scaled)
Expected Output
Original Data: Values 0 1 1 2 2 3 3 4 4 5
Scaled Training Data: [[-1.414] [-0.707] [ 0. ] [ 0.707] [ 1.414]]
Scaled New Data: [[2.121]]
Final Understanding
|
Method |
Purpose |
|---|---|
|
fit() |
Learns mean and standard deviation |
|
transform() |
Applies scaling |
|
fit_transform() |
Learns and applies scaling together |
Training data uses:
fit_transform()
Testing/new data uses:
transform()