EDA in Python
EDA in Python
EDA in Python involves analyzing and visualizing datasets using Python libraries to better understand the structure, patterns, relationships, and quality of the data before building machine learning models.
Python provides powerful libraries that make data analysis and visualization simple and efficient.
Why Use Python for EDA
Python is widely used for EDA because:
-
Easy to learn
-
Large ecosystem of libraries
-
Powerful visualization tools
-
Efficient data handling
-
Strong community support
Common Python Libraries for EDA
| Library | Purpose |
|---|---|
| Pandas | Data analysis |
| NumPy | Numerical operations |
| Matplotlib | Basic plotting |
| Seaborn | Statistical visualization |
Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Loading Dataset
Datasets are commonly loaded using Pandas.
Example
import pandas as pd
df = pd.read_csv("data.csv")
print(df.head())
Understanding Dataset Structure
1. Viewing First Rows
print(df.head())
Displays the first 5 rows of the dataset.
2. Viewing Last Rows
print(df.tail())
Displays the last 5 rows.
3. Dataset Shape
print(df.shape)
Shows:
(rows, columns)
4. Dataset Information
print(df.info())
Provides:
-
Column names
-
Data types
-
Missing values
5. Statistical Summary
print(df.describe())
Displays:
-
Mean
-
Median
-
Standard deviation
-
Minimum and maximum values
Checking Missing Values
Missing values are common in datasets.
Example
print(df.isnull().sum())
Shows the number of missing values in each column.
Detecting Duplicate Records
print(df.duplicated().sum())
Counts duplicate rows.
Univariate Analysis in Python
Analyzing one variable at a time.
Histogram
plt.hist(df["Age"])
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.title("Age Distribution")
plt.show()
What We Learn
-
Distribution of values
-
Most common range
-
Skewness
Box Plot
sns.boxplot(x=df["Salary"])
plt.show()
What We Learn
-
Outliers
-
Spread of data
-
Median values
Bivariate Analysis in Python
Analyzing relationships between two variables.
Scatter Plot
plt.scatter(df["Experience"], df["Salary"])
plt.xlabel("Experience")
plt.ylabel("Salary")
plt.show()
What We Learn
-
Positive correlation
-
Negative correlation
-
Trends
Correlation Matrix
print(df.corr())
Shows relationships between numerical variables.
Heatmap Visualization
sns.heatmap(df.corr(), annot=True)
plt.show()
What We Learn
-
Strong relationships
-
Weak relationships
-
Feature dependencies
Count Plot
Used for categorical data analysis.
Example
sns.countplot(x=df["Gender"])
plt.show()
What We Learn
-
Frequency of categories
-
Class imbalance
Pair Plot
Visualizes relationships among multiple variables.
Example
sns.pairplot(df)
plt.show()
What We Learn
-
Feature interactions
-
Trends
-
Clusters
-
Correlations
Distribution Plot
sns.histplot(df["Salary"], kde=True)
plt.show()
What We Learn
-
Data distribution
-
Density estimation
-
Skewness
Real-World Example
Employee Dataset Analysis
Suppose a company dataset contains:
-
Age
-
Salary
-
Experience
-
Department
EDA in Python helps answer:
-
Which department has highest salaries?
-
Does experience affect salary?
-
Are there salary outliers?
Complete EDA Example in Python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = {
"Age": [22, 25, 30, 35, 40],
"Salary": [30000, 40000, 50000, 60000, 70000],
"Experience": [1, 2, 4, 6, 8]
}
df = pd.DataFrame(data)
# Dataset summary
print(df.info())
print(df.describe())
# Histogram
plt.hist(df["Age"])
plt.title("Age Distribution")
plt.show()
# Scatter Plot
plt.scatter(df["Experience"], df["Salary"])
plt.title("Experience vs Salary")
plt.show()
# Heatmap
sns.heatmap(df.corr(), annot=True)
plt.show()
Benefits of EDA in Python
-
Easy data analysis
-
Powerful visualizations
-
Faster understanding of datasets
-
Better feature selection
-
Improved preprocessing decisions
Important Points
1. Pandas is mainly used for data manipulation and analysis.
2. Matplotlib and Seaborn are used for data visualization.
3. Heatmaps help identify correlations between variables.
4. Box plots are useful for detecting outliers.
5. EDA helps understand datasets before machine learning model training.
Summary
EDA in Python uses libraries such as Pandas, NumPy, Matplotlib, and Seaborn to analyze and visualize datasets. It helps identify missing values, outliers, distributions, correlations, and trends, making it easier to prepare data for machine learning models.
Keywords
EDA in Python, Exploratory Data Analysis in Python, Python EDA, Data Visualization in Python, Pandas EDA, Seaborn Visualization, Matplotlib Tutorial, Correlation Heatmap, Histogram in Python, Scatter Plot in Python, Box Plot in Python, Pair Plot, Data Analysis using Python, Missing Value Analysis, Outlier Detection, Statistical Analysis using Python