Data Preprocessing
Data Preprocessing
Data Preprocessing is the process of converting raw data into a clean and understandable format before feeding it into a machine learning model. Since machine learning algorithms cannot directly work with raw and unorganized data, preprocessing helps transform the data into a suitable format for better learning and prediction.
Data Preprocessing is one of the most important steps in Machine Learning because the quality and format of the data directly affect model performance.
Why Data Preprocessing is Important
Data Preprocessing helps:
- Improve model accuracy
- Convert data into machine-readable format
- Handle categorical and numerical data
- Scale features properly
- Reduce training time
Machine learning algorithms perform better when data is properly preprocessed.
Common Steps in Data Preprocessing
1. Handling Missing Values
2. Encoding Categorical Data
3. Feature Scaling
4. Splitting Dataset
5. Handling Imbalanced Data
6. Data Transformation
1. Handling Missing Values
Datasets often contain missing or empty values.
Example:
| Age |
|---|
| 25 |
| NULL |
| 30 |
Common Methods:
- Mean Imputation
- Median Imputation
- Mode Imputation
- Removing rows
2. Encoding Categorical Data
Machine learning models work with numbers, not text values. Categorical data must be converted into numerical form.
Example:
| Gender |
|---|
| Male |
| Female |
Converted as:
| Gender |
|---|
| 0 |
| 1 |
Types of Encoding
Label Encoding
Converts categories into numbers.
Example:
Male → 0
Female → 1
One-Hot Encoding
Creates separate columns for each category.
Example:
| Red | Blue | Green |
|---|---|---|
| 1 | 0 | 0 |
One-Hot Encoding is preferred when categories do not have any order.
3. Feature Scaling
Feature Scaling standardizes the range of numerical values so that all features contribute equally to the model.
Why Scaling is Needed ?
Suppose:
| Feature | Value |
|---|---|
| Age | 25 |
| Salary | 500000 |
Salary values are much larger than Age values. Some algorithms may give more importance to Salary.
Types of Scaling :
Standardization
Transforms data to have:
-
- Mean = 0
- Standard deviation = 1
Normalization
Scales values between 0 and 1.
4. Splitting the Dataset
The dataset is divided into:
- Training Data
- Testing Data
Training Data : Used to train the model.
Testing Data : Used to evaluate model performance on unseen data.
Common Split Ratio:
80% → Training Data
20% → Testing Data
Example:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2
)
5. Handling Imbalanced Data
Sometimes one class has significantly more samples than another.
Example:
| Class | Count |
|---|---|
| Not Fraud | 990 |
| Fraud | 10 |
This is called imbalanced data.
Problems with Imbalanced Data
The model may ignore minority classes and produce misleading accuracy.
Solutions:
- Oversampling
- Undersampling
- SMOTE Technique
6. Data Transformation
Data Transformation converts data into suitable formats for machine learning.
Examples
- Log Transformation
- Power Transformation
- Binning
- Scaling
Real-World Example
Student Performance Dataset
Suppose a dataset contains:
- Missing marks
- Gender as text
- Different score ranges
Step 1:
Fill missing marks using average values.
Step 2:
Convert Gender into numerical form.
Step 3:
Scale marks between 0 and 1.
Step 4:
Split dataset into training and testing sets.
Python Example Using Pandas
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
data = {
"Gender": ["Male", "Female", "Male"],
"Age": [25, 30, 28],
"Salary": [50000, 60000, 55000]
}
df = pd.DataFrame(data)
# Label Encoding
encoder = LabelEncoder()
df["Gender"] = encoder.fit_transform(df["Gender"])
# Feature Scaling
scaler = StandardScaler()
df[["Age", "Salary"]] = scaler.fit_transform(
df[["Age", "Salary"]]
)
print(df)
Output:
Gender Age Salary
0 1 -1.224745 -1.224745
1 0 1.224745 1.224745
2 1 0.000000 0.000000
Benefits of Data Preprocessing
- Improves model accuracy
- Reduces bias
- Speeds up training
- Helps algorithms learn efficiently
- Produces reliable predictions
Important Points:
1. Machine learning algorithms cannot directly work with raw categorical data.
2. Feature Scaling is important for distance-based algorithms.
3. One-Hot Encoding is preferred for unordered categorical data.
4. Training data is used to train models, while testing data is used for evaluation.
5. Imbalanced datasets can lead to misleading model performance.
Summary
Data Preprocessing is the process of preparing raw data for machine learning by handling missing values, encoding categorical variables, scaling numerical data, splitting datasets, and transforming features. Proper preprocessing improves model performance, training efficiency, and prediction accuracy.
Keywords
Data Preprocessing, Data Preprocessing in Machine Learning, Feature Scaling, Encoding Categorical Data, Label Encoding, One Hot Encoding, Normalization, Standardization, Train Test Split, Data Transformation, Handling Missing Values, Imbalanced Data, Machine Learning Pipeline, Python Data Preprocessing, Scikit Learn Preprocessing