Data Cleaning

Data Cleaning

Data Cleaning is the process of identifying and correcting errors, inconsistencies, and unwanted data in a dataset before using it for machine learning. Real-world datasets are often incomplete and messy, so cleaning the data is one of the most important steps in the ML pipeline.

Poor-quality data can lead to poor model performance, inaccurate predictions, and unreliable results. That is why Data Cleaning is essential before training any machine learning model.

Why Data Cleaning is Important

Data Cleaning helps:

  • Improve data quality
  • Increase model accuracy
  • Remove incorrect information
  • Reduce noise in the dataset
  • Improve training efficiency

Common Problems in Data

Real-world datasets usually contain several issues, such as: 

  • Missing values
  • Duplicate records
  • Inconsistent data
  • Incorrect formats
  • Outliers
  • Noisy data

1. Missing Values

Missing values occur when some data is unavailable or empty.

Name Age
John 25
Alex NULL
Sam 30

Here, Alex’s age value is missing.

Methods to Handle Missing Values

1. Remove Missing Rows

Used when only a few values are missing.

2. Mean Imputation

Replace missing numerical values with the average value.

3. Median Imputation

Useful when outliers are present.

4. Mode Imputation

Used for categorical data.

Example:

Replacing missing age values with the average age.

2. Duplicate Data

Duplicate records are repeated rows in the dataset.

ID Name
101 John
101 John

Both rows are duplicates.

Why Duplicates are Problematic

Duplicates can:

  • Bias the model
  • Increase training time
  • Reduce data quality

Solution:

Remove duplicates using functions like:

drop_duplicates()

3. Inconsistent Data

Sometimes the same information is written in different formats.

Example:

Male
male
M
MALE

All represent the same category but are inconsistent.

Solution

Convert all values into a standard format.

Example:

Convert all values to lowercase.

4. Outliers

Outliers are unusually high or low values compared to the rest of the data.

Example:

20, 25, 30, 28, 500

Here, 500 is an outlier.

Why Outliers Matter

Outliers can:

  • Affect averages
  • Distort model learning
  • Reduce prediction accuracy

Methods to Handle Outliers

Techniques:

  • Remove outliers
  • Use IQR method
  • Use Z-score method
  • Apply transformations

5. Incorrect Data Formats

Data may be stored in incorrect formats.

Example:

Date
"12-05-2026"

Stored as text instead of date format.

Why This is a Problem

Incorrect formats can:

  • Cause processing errors
  • Affect analysis
  • Reduce model efficiency

Solution:

Convert data into proper formats.

6. Noisy Data

Noisy data contains random errors or meaningless values.

Example:

Sensor errors producing unrealistic temperature values.

Solution

Noise can be reduced using:

  • Filtering techniques
  • Smoothing methods
  • Data validation

Steps in Data Cleaning

1. Identify problems
2. Handle missing values
3. Remove duplicates
4. Fix inconsistencies
5. Handle outliers
6. Correct formats
7. Validate cleaned data

Example — Cleaning House Price Data

Suppose we have:

Area Price
1200 50L
NULL 60L
1500 500 Cr

Problems:

  • Missing area value
  • Outlier in price

Cleaning:

  • Fill missing area with average area
  • Remove or handle outlier price

Python Example

import pandas as pd

df = pd.read_csv("data.csv")

# Remove duplicates
df = df.drop_duplicates()

# Fill missing values
df["Age"] = df["Age"].fillna(df["Age"].mean())

Benefits of Data Cleaning

  • Better data quality
  • Improved model accuracy
  • Faster model training
  • Reliable predictions
  • Better business decisions

Keywords

Data Cleaning, Data Cleaning in Machine Learning, Missing Values, Handling Missing Data, Duplicate Data Removal, Outlier Detection, Noisy Data, Data Preprocessing, Data Quality, Mean Imputation, Median Imputation, Data Transformation, Data Validation, Machine Learning Pipeline, Python Data Cleaning, Pandas Data Cleaning

Previous Topic ML Pipeline Next Topic Data Cleaning Example