Example 1
EDA in Python — Practical Example Using Small Dataset
Let us understand EDA in Python using a small employee dataset. This example demonstrates how to analyze and visualize data step by step using Pandas, Matplotlib, and Seaborn.
Step 1: Import Required Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Step 2: Create Sample Dataset
data = {
"Name": ["John", "Alex", "Sam", "Ravi", "Anu"],
"Age": [22, 25, 30, 35, 28],
"Salary": [30000, 40000, 50000, 60000, 45000],
"Experience": [1, 2, 5, 7, 3],
"Department": ["HR", "IT", "IT", "Finance", "HR"]
}
df = pd.DataFrame(data)
print(df)
Name Age Salary Experience Department
0 John 22 30000 1 HR
1 Alex 25 40000 2 IT
2 Sam 30 50000 5 IT
3 Ravi 35 60000 7 Finance
4 Anu 28 45000 3 HR
Step 3: Understanding the Dataset
View First Rows
print(df.head())
Dataset Shape
print(df.shape)
Output
(5, 5)
Dataset Information
print(df.info())
Statistical Summary
print(df.describe())
Step 4: Checking Missing Values
print(df.isnull().sum())
Output
Name 0
Age 0
Salary 0
Experience 0
Department 0
Step 5: Histogram
Histogram is used to visualize data distribution.
Example — Age Distribution
plt.hist(df["Age"])
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.title("Age Distribution")
plt.show()
Step 6: Scatter Plot
Scatter plots are used to analyze relationships between two numerical variables.
Example — Experience vs Salary
plt.scatter(df["Experience"], df["Salary"])
plt.xlabel("Experience")
plt.ylabel("Salary")
plt.title("Experience vs Salary")
plt.show()
Step 7: Box Plot
Box plots help visualize data spread and outliers.
Example — Salary Distribution
sns.boxplot(x=df["Salary"])
plt.title("Salary Box Plot")
plt.show()
Step 8: Count Plot
Count plots display the frequency of categorical values.
Example — Department Count
sns.countplot(x=df["Department"])
plt.title("Department Count")
plt.show()
Step 9: Correlation Analysis
Correlation measures relationships between numerical features.
Example
print(df.corr(numeric_only=True))
Heatmap Visualization
sns.heatmap(df.corr(numeric_only=True),
annot=True,
cmap="Blues")
plt.title("Correlation Heatmap")
plt.show()
Complete Program
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Sample Dataset
data = {
"Name": ["John", "Alex", "Sam", "Ravi", "Anu"],
"Age": [22, 25, 30, 35, 28],
"Salary": [30000, 40000, 50000, 60000, 45000],
"Experience": [1, 2, 5, 7, 3],
"Department": ["HR", "IT", "IT", "Finance", "HR"]
}
df = pd.DataFrame(data)
# Dataset Information
print(df.head())
print(df.info())
print(df.describe())
# Missing Values
print(df.isnull().sum())
# Histogram
plt.hist(df["Age"])
plt.title("Age Distribution")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.show()
# Scatter Plot
plt.scatter(df["Experience"], df["Salary"])
plt.title("Experience vs Salary")
plt.xlabel("Experience")
plt.ylabel("Salary")
plt.show()
# Box Plot
sns.boxplot(x=df["Salary"])
plt.title("Salary Box Plot")
plt.show()
# Count Plot
sns.countplot(x=df["Department"])
plt.title("Department Count")
plt.show()
# Correlation Heatmap
sns.heatmap(df.corr(numeric_only=True),
annot=True,
cmap="Blues")
plt.title("Correlation Heatmap")
plt.show()
Summary
In this example, we performed EDA using Python libraries such as Pandas, Matplotlib, and Seaborn. The dataset was analyzed using statistical summaries and visualizations including histograms, scatter plots, box plots, count plots, and heatmaps to better understand the structure and relationships within the data.
Keywords
EDA Practical Example, EDA using Python, Exploratory Data Analysis Example, Python EDA Example, Pandas EDA Tutorial, Seaborn Visualization Example, Matplotlib EDA, Histogram in Python, Scatter Plot Example, Box Plot Example, Correlation Heatmap, Data Visualization using Python, Employee Dataset EDA, Python Data Analysis Example