EDA in Python

EDA in Python

EDA in Python involves analyzing and visualizing datasets using Python libraries to better understand the structure, patterns, relationships, and quality of the data before building machine learning models.

Python provides powerful libraries that make data analysis and visualization simple and efficient.

Why Use Python for EDA

Python is widely used for EDA because:

  • Easy to learn

  • Large ecosystem of libraries

  • Powerful visualization tools

  • Efficient data handling

  • Strong community support

Common Python Libraries for EDA

Library Purpose
Pandas Data analysis
NumPy Numerical operations
Matplotlib Basic plotting
Seaborn Statistical visualization

Importing Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Loading Dataset

Datasets are commonly loaded using Pandas.

Example

import pandas as pd

df = pd.read_csv("data.csv")

print(df.head())

Understanding Dataset Structure

1. Viewing First Rows

print(df.head())

Displays the first 5 rows of the dataset.

2. Viewing Last Rows

print(df.tail())

Displays the last 5 rows.

3. Dataset Shape

print(df.shape)

Shows:

(rows, columns)

4. Dataset Information

print(df.info())

Provides:

  • Column names

  • Data types

  • Missing values

5. Statistical Summary

print(df.describe())

Displays:

  • Mean

  • Median

  • Standard deviation

  • Minimum and maximum values

Checking Missing Values

Missing values are common in datasets.

Example

print(df.isnull().sum())

Shows the number of missing values in each column.

Detecting Duplicate Records

print(df.duplicated().sum())

Counts duplicate rows.

Univariate Analysis in Python

Analyzing one variable at a time.

Histogram

plt.hist(df["Age"])

plt.xlabel("Age")
plt.ylabel("Frequency")
plt.title("Age Distribution")

plt.show()

What We Learn

  • Distribution of values

  • Most common range

  • Skewness

Box Plot

sns.boxplot(x=df["Salary"])

plt.show()

What We Learn

  • Outliers

  • Spread of data

  • Median values

Bivariate Analysis in Python

Analyzing relationships between two variables.

Scatter Plot

plt.scatter(df["Experience"], df["Salary"])

plt.xlabel("Experience")
plt.ylabel("Salary")

plt.show()

What We Learn

  • Positive correlation

  • Negative correlation

  • Trends

Correlation Matrix

print(df.corr())

Shows relationships between numerical variables.

Heatmap Visualization

sns.heatmap(df.corr(), annot=True)

plt.show()

What We Learn

  • Strong relationships

  • Weak relationships

  • Feature dependencies

Count Plot

Used for categorical data analysis.

Example

sns.countplot(x=df["Gender"])

plt.show()

What We Learn

  • Frequency of categories

  • Class imbalance

Pair Plot

Visualizes relationships among multiple variables.

Example

sns.pairplot(df)

plt.show()

What We Learn

  • Feature interactions

  • Trends

  • Clusters

  • Correlations

Distribution Plot

sns.histplot(df["Salary"], kde=True)

plt.show()

What We Learn

  • Data distribution

  • Density estimation

  • Skewness

Real-World Example

Employee Dataset Analysis

Suppose a company dataset contains:

  • Age

  • Salary

  • Experience

  • Department

EDA in Python helps answer:

  • Which department has highest salaries?

  • Does experience affect salary?

  • Are there salary outliers?

Complete EDA Example in Python

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

data = {
    "Age": [22, 25, 30, 35, 40],
    "Salary": [30000, 40000, 50000, 60000, 70000],
    "Experience": [1, 2, 4, 6, 8]
}

df = pd.DataFrame(data)

# Dataset summary
print(df.info())
print(df.describe())

# Histogram
plt.hist(df["Age"])
plt.title("Age Distribution")
plt.show()

# Scatter Plot
plt.scatter(df["Experience"], df["Salary"])
plt.title("Experience vs Salary")
plt.show()

# Heatmap
sns.heatmap(df.corr(), annot=True)
plt.show()

Benefits of EDA in Python

  • Easy data analysis

  • Powerful visualizations

  • Faster understanding of datasets

  • Better feature selection

  • Improved preprocessing decisions

Important Points

1. Pandas is mainly used for data manipulation and analysis.

2. Matplotlib and Seaborn are used for data visualization.

3. Heatmaps help identify correlations between variables.

4. Box plots are useful for detecting outliers.

5. EDA helps understand datasets before machine learning model training.

Summary

EDA in Python uses libraries such as Pandas, NumPy, Matplotlib, and Seaborn to analyze and visualize datasets. It helps identify missing values, outliers, distributions, correlations, and trends, making it easier to prepare data for machine learning models.

Keywords

EDA in Python, Exploratory Data Analysis in Python, Python EDA, Data Visualization in Python, Pandas EDA, Seaborn Visualization, Matplotlib Tutorial, Correlation Heatmap, Histogram in Python, Scatter Plot in Python, Box Plot in Python, Pair Plot, Data Analysis using Python, Missing Value Analysis, Outlier Detection, Statistical Analysis using Python

Previous Topic Exploratory Data Analysis Next Topic Example 1