Data Cleaning Example
Data Cleaning Example Using Pandas
Let us understand Data Cleaning using a small sample dataset.
Sample Dataset
import pandas as pd
data = {
"Name": ["John", "Alex", "Sam", "John", "Ravi"],
"Age": [25, None, 30, 25, 200],
"Salary": [50000, 60000, None, 50000, 70000],
"Gender": ["Male", "male", "M", "Male", "MALE"]
}
df = pd.DataFrame(data)
print(df)
Output
Name Age Salary Gender
0 John 25.0 50000.0 Male
1 Alex NaN 60000.0 male
2 Sam 30.0 NaN M
3 John 25.0 50000.0 Male
4 Ravi 200.0 70000.0 MALE
This dataset has:
-
Missing values
-
Duplicate records
-
Inconsistent gender values
-
Outlier in Age column
1. Handling Missing Values
df["Age"] = df["Age"].fillna(df["Age"].mean())
df["Salary"] = df["Salary"].fillna(df["Salary"].mean())
Here, missing values in Age and Salary are replaced with the average value of each column.
2. Removing Duplicate Records
df = df.drop_duplicates()
This removes repeated rows from the dataset.
3. Fixing Inconsistent Data
df["Gender"] = df["Gender"].replace({
"male": "Male",
"M": "Male",
"MALE": "Male"
})
Here, different forms of the same gender value are converted into one standard format.
4. Handling Outliers
In the Age column, value 200 is not realistic.
df = df[df["Age"] < 100]
This removes rows where age is greater than 100.
Complete Code
import pandas as pd
data = {
"Name": ["John", "Alex", "Sam", "John", "Ravi"],
"Age": [25, None, 30, 25, 200],
"Salary": [50000, 60000, None, 50000, 70000],
"Gender": ["Male", "male", "M", "Male", "MALE"]
}
df = pd.DataFrame(data)
print("Original Dataset:")
print(df)
# Fill missing values
df["Age"] = df["Age"].fillna(df["Age"].mean())
df["Salary"] = df["Salary"].fillna(df["Salary"].mean())
# Remove duplicate rows
df = df.drop_duplicates()
# Standardize gender values
df["Gender"] = df["Gender"].replace({
"male": "Male",
"M": "Male",
"MALE": "Male"
})
# Remove age outliers
df = df[df["Age"] < 100]
print("\nCleaned Dataset:")
print(df)
Cleaned Dataset
Name Age Salary Gender
0 John 25.0 50000.0 Male
1 Alex 70.0 60000.0 Male
2 Sam 30.0 57500.0 Male
Summary
In this example, we cleaned the dataset by filling missing values, removing duplicate records, fixing inconsistent values, and removing outliers. This cleaned data is now more reliable and suitable for machine learning model training.