Data Cleaning Example

Data Cleaning Example Using Pandas

Let us understand Data Cleaning using a small sample dataset.

Sample Dataset

import pandas as pd

data = {
"Name": ["John", "Alex", "Sam", "John", "Ravi"],
"Age": [25, None, 30, 25, 200],
"Salary": [50000, 60000, None, 50000, 70000],
"Gender": ["Male", "male", "M", "Male", "MALE"]
}

df = pd.DataFrame(data)
print(df)

Output

   Name    Age   Salary Gender
0 John 25.0 50000.0 Male
1 Alex NaN 60000.0 male
2 Sam 30.0 NaN M
3 John 25.0 50000.0 Male
4 Ravi 200.0 70000.0 MALE

This dataset has:

  • Missing values

  • Duplicate records

  • Inconsistent gender values

  • Outlier in Age column

1. Handling Missing Values

df["Age"] = df["Age"].fillna(df["Age"].mean())
df["Salary"] = df["Salary"].fillna(df["Salary"].mean())

Here, missing values in Age and Salary are replaced with the average value of each column.

2. Removing Duplicate Records

df = df.drop_duplicates()

This removes repeated rows from the dataset.

3. Fixing Inconsistent Data

df["Gender"] = df["Gender"].replace({
"male": "Male",
"M": "Male",
"MALE": "Male"
})

Here, different forms of the same gender value are converted into one standard format.

4. Handling Outliers

In the Age column, value 200 is not realistic.

df = df[df["Age"] < 100]

This removes rows where age is greater than 100.

Complete Code

import pandas as pd

data = {
"Name": ["John", "Alex", "Sam", "John", "Ravi"],
"Age": [25, None, 30, 25, 200],
"Salary": [50000, 60000, None, 50000, 70000],
"Gender": ["Male", "male", "M", "Male", "MALE"]
}

df = pd.DataFrame(data)

print("Original Dataset:")
print(df)

# Fill missing values
df["Age"] = df["Age"].fillna(df["Age"].mean())
df["Salary"] = df["Salary"].fillna(df["Salary"].mean())

# Remove duplicate rows
df = df.drop_duplicates()

# Standardize gender values
df["Gender"] = df["Gender"].replace({
"male": "Male",
"M": "Male",
"MALE": "Male"
})

# Remove age outliers
df = df[df["Age"] < 100]

print("\nCleaned Dataset:")
print(df)

Cleaned Dataset

   Name   Age   Salary Gender
0 John 25.0 50000.0 Male
1 Alex 70.0 60000.0 Male
2 Sam 30.0 57500.0 Male

Summary

In this example, we cleaned the dataset by filling missing values, removing duplicate records, fixing inconsistent values, and removing outliers. This cleaned data is now more reliable and suitable for machine learning model training.

Previous Topic Data Cleaning