Data Cleaning

Data Cleaning is the process of identifying and correcting errors, inconsistencies, and unwanted data in a dataset before using it for machine learning. Real-world datasets are often incomplete and messy, so cleaning the data is one of the most important steps in the ML pipeline.

Poor-quality data can lead to poor model performance, inaccurate predictions, and unreliable results. That is why Data Cleaning is essential before training any machine learning model.

Why Data Cleaning is Important

Data Cleaning helps:

Improve data quality
Increase model accuracy
Remove incorrect information
Reduce noise in the dataset
Improve training efficiency

Common Problems in Data

Real-world datasets usually contain several issues, such as:

Missing values
Duplicate records
Inconsistent data
Incorrect formats
Outliers
Noisy data

1. Missing Values

Missing values occur when some data is unavailable or empty.

Name	Age
John	25
Alex	NULL
Sam	30

Here, Alex’s age value is missing.

Methods to Handle Missing Values

1. Remove Missing Rows

Used when only a few values are missing.

2. Mean Imputation

Replace missing numerical values with the average value.

3. Median Imputation

Useful when outliers are present.

4. Mode Imputation

Used for categorical data.

Example:

Replacing missing age values with the average age.

2. Duplicate Data

Duplicate records are repeated rows in the dataset.

ID	Name
101	John
101	John

Both rows are duplicates.

Why Duplicates are Problematic

Duplicates can:

Bias the model
Increase training time
Reduce data quality

Solution:

Remove duplicates using functions like:

drop_duplicates()

3. Inconsistent Data

Sometimes the same information is written in different formats.

Example:

Male
male
M
MALE

All represent the same category but are inconsistent.

Solution

Convert all values into a standard format.

Example:

Convert all values to lowercase.

4. Outliers

Outliers are unusually high or low values compared to the rest of the data.

Example:

20, 25, 30, 28, 500

Here, 500 is an outlier.

Why Outliers Matter

Outliers can:

Affect averages
Distort model learning
Reduce prediction accuracy

Methods to Handle Outliers

Techniques:

Remove outliers
Use IQR method
Use Z-score method
Apply transformations

5. Incorrect Data Formats

Data may be stored in incorrect formats.

Example:

Date
"12-05-2026"

Stored as text instead of date format.

Why This is a Problem

Incorrect formats can:

Cause processing errors
Affect analysis
Reduce model efficiency

Solution:

Convert data into proper formats.

6. Noisy Data

Noisy data contains random errors or meaningless values.

Example:

Sensor errors producing unrealistic temperature values.

Solution

Noise can be reduced using:

Filtering techniques
Smoothing methods
Data validation

Steps in Data Cleaning

1. Identify problems
2. Handle missing values
3. Remove duplicates
4. Fix inconsistencies
5. Handle outliers
6. Correct formats
7. Validate cleaned data

Example — Cleaning House Price Data

Suppose we have:

Area	Price
1200	50L
NULL	60L
1500	500 Cr

Problems:

Missing area value
Outlier in price

Cleaning:

Fill missing area with average area
Remove or handle outlier price

Python Example

import pandas as pd

df = pd.read_csv("data.csv")

# Remove duplicates
df = df.drop_duplicates()

# Fill missing values
df["Age"] = df["Age"].fillna(df["Age"].mean())

Benefits of Data Cleaning

Better data quality
Improved model accuracy
Faster model training
Reliable predictions
Better business decisions

Keywords

Data Cleaning, Data Cleaning in Machine Learning, Missing Values, Handling Missing Data, Duplicate Data Removal, Outlier Detection, Noisy Data, Data Preprocessing, Data Quality, Mean Imputation, Median Imputation, Data Transformation, Data Validation, Machine Learning Pipeline, Python Data Cleaning, Pandas Data Cleaning

Data Cleaning

Data Cleaning

Why Data Cleaning is Important

Common Problems in Data

1. Missing Values

Methods to Handle Missing Values

2. Duplicate Data

3. Inconsistent Data

4. Outliers

5. Incorrect Data Formats

6. Noisy Data

Steps in Data Cleaning

Example — Cleaning House Price Data

Python Example

Benefits of Data Cleaning

Check your knowledge

Congratulations!