Project 2_2
Project 2_2: Sales Data Visualization & EDA using Python
Exploratory Data Analysis (EDA) on Real Sales Dataset
Objective
In this project, we will perform Exploratory Data Analysis (EDA) on a real-world sales dataset.
We will learn:
-
how to load real datasets,
-
understand business sales data,
-
clean data,
-
analyze trends,
-
and visualize important business insights using Python.
This is exactly how data analysts and ML engineers start real projects in industries.
Cell 1: Import Required Libraries
import pandas as pd
import matplotlib.pyplot as plt
We are importing:
-
pandas → used for handling datasets
-
matplotlib.pyplot → used for graphs and visualization
These are the most commonly used libraries in Data Analysis and Machine Learning.
Cell 2: Load the Dataset
df = pd.read_csv("sales_data_sample.csv", encoding="latin1")
Here we load the CSV dataset using:
pd.read_csv()
-
sales_data_sample.csv → dataset file name
-
encoding="latin1" → helps avoid text reading issues in some datasets
After loading, the dataset is stored in variable df.
Cell 3: Display First 5 Rows
df.head()
head() displays the first 5 records.
This helps us:
-
verify dataset loading,
-
understand column names,
-
and observe sample values.
Cell 4: Check Dataset Shape
df.shape
shape gives:
(rows, columns)
Example:
(2823, 25)
Meaning:
-
2823 rows
-
25 columns
This tells us dataset size.
Cell 5: Check Dataset Information
df.info()
info() gives:
-
column names
-
number of non-null values
-
data types
-
memory usage
This helps identify:
-
missing values
-
incorrect data types
Cell 6: Statistical Summary
df.describe()
describe() gives statistical details for numerical columns.
It shows:
-
mean
-
minimum value
-
maximum value
-
standard deviation
-
percentiles
This helps understand sales distribution.
Cell 7: Check Missing Values
df.isnull().sum()
This checks missing values in each column.
Missing values are common in real-world datasets.
Before analysis or ML:
-
we must identify missing data,
-
and later clean or handle it.
Cell 8: View Column Names
df.columns
This displays all dataset columns.
Typical columns in sales datasets:
-
ORDERNUMBER
-
QUANTITYORDERED
-
PRICEEACH
-
SALES
-
PRODUCTLINE
-
CUSTOMERNAME
-
COUNTRY
-
DEALSIZE
Understanding columns is very important before analysis.
Cell 9: Select Important Columns
sales_df = df[[
"ORDERNUMBER",
"SALES",
"PRODUCTLINE",
"COUNTRY",
"YEAR_ID",
"MONTH_ID",
"DEALSIZE"
]]
sales_df.head()
Real datasets contain many columns.
For beginner analysis, we select only useful columns.
This makes the project:
-
easier,
-
cleaner,
-
and more understandable.
Cell 10: Total Sales
total_sales = sales_df["SALES"].sum()
print("Total Sales:", total_sales)
sum() adds all sales values.
This gives total business revenue.
This is one of the most basic business metrics.
Cell 11: Average Sales
average_sales = sales_df["SALES"].mean()
print("Average Sales:", average_sales)
mean() calculates average sales value.
This helps understand:
-
average customer purchase,
-
or average order amount.
Cell 12: Highest Sales Order
highest_sale = sales_df[sales_df["SALES"] == sales_df["SALES"].max()]
highest_sale
Here we identify the highest sales transaction.
max() gives the highest sales value.
Then we filter the dataset to show that record.
Cell 13: Lowest Sales Order
lowest_sale = sales_df[sales_df["SALES"] == sales_df["SALES"].min()]
lowest_sale
This identifies the lowest sales transaction.
Businesses often analyze:
-
highest-performing orders,
-
and weakest-performing orders.
Cell 14: Sales by Product Line
product_sales = sales_df.groupby("PRODUCTLINE")["SALES"].sum()
print(product_sales)
groupby() groups data based on categories.
Here we group by:
-
PRODUCTLINE
Then calculate:
-
total sales for each category
This helps identify best-selling product types.
Cell 15: Product Line Bar Chart
product_sales.plot(kind="bar")
plt.xlabel("Product Line")
plt.ylabel("Total Sales")
plt.title("Sales by Product Line")
plt.show()
This bar chart compares sales across product categories.
Businesses use this to:
-
identify profitable products,
-
improve inventory planning,
-
increase marketing focus.
Cell 16: Country-wise Sales Analysis
country_sales = sales_df.groupby("COUNTRY")["SALES"].sum()
country_sales.sort_values(ascending=False).head(10)
This calculates sales country-wise.
Then:
-
sorts sales in descending order,
-
displays top 10 countries.
This helps businesses identify strong markets.
Cell 17: Top Countries Sales Chart
country_sales.sort_values(ascending=False).head(10).plot(kind="bar")
plt.xlabel("Country")
plt.ylabel("Sales")
plt.title("Top 10 Countries by Sales")
plt.show()
This chart visualizes top-performing countries.
Businesses use this to:
-
focus regional marketing,
-
improve international sales strategies.
Cell 18: Monthly Sales Trend
monthly_sales = sales_df.groupby("MONTH_ID")["SALES"].sum()
print(monthly_sales)
This calculates total monthly sales.
Grouping by month helps identify:
-
seasonal trends,
-
peak business months,
-
weak sales periods.
Cell 19: Monthly Sales Line Chart
monthly_sales.plot(marker="o")
plt.xlabel("Month")
plt.ylabel("Sales")
plt.title("Monthly Sales Trend")
plt.show()
Line charts are best for trend analysis.
This graph helps observe:
-
increasing sales,
-
decreasing sales,
-
seasonal patterns.
Cell 20: Deal Size Analysis
deal_sales = sales_df.groupby("DEALSIZE")["SALES"].sum()
deal_sales
This analyzes sales based on deal size.
Typical deal sizes:
-
Small
-
Medium
-
Large
This helps businesses understand:
-
which deal size generates maximum revenue.
Cell 21: Deal Size Pie Chart
deal_sales.plot(
kind="pie",
autopct="%1.1f%%"
)
plt.title("Sales Distribution by Deal Size")
plt.ylabel("")
plt.show()
Pie charts show percentage contribution.
This graph shows:
-
how much revenue comes from each deal size category.
Cell 22: Year-wise Sales Analysis
year_sales = sales_df.groupby("YEAR_ID")["SALES"].sum()
print(year_sales)
This calculates yearly sales.
This helps businesses compare:
-
yearly growth,
-
yearly decline.
Cell 23: Year-wise Sales Bar Chart
year_sales.plot(kind="bar")
plt.xlabel("Year")
plt.ylabel("Sales")
plt.title("Year-wise Sales")
plt.show()
This chart compares business performance across years.
This is useful for:
-
business reporting,
-
annual growth analysis.
Cell 24: Correlation Analysis
sales_df.corr(numeric_only=True)
Correlation shows relationship between numerical columns.
Values range from:
-1 to +1
If close to:
-
+1 → strong positive relationship
-
-1 → strong negative relationship
This helps identify patterns before Machine Learning.
Cell 25: Final Business Insights
print("Business Insights:")
print("1. Total business sales calculated successfully.")
print("2. Product lines with highest sales identified.")
print("3. Top-performing countries identified.")
print("4. Monthly sales trends analyzed.")
print("5. Deal size contribution analyzed.")
print("6. Year-wise business growth visualized.")
This is the final summary of our EDA.
EDA is all about:
-
understanding the data,
-
identifying patterns,
-
and extracting useful business insights.
Summary
In this project, we performed Exploratory Data Analysis (EDA) on a real-world sales dataset using Python. We learned how to load datasets, analyze sales trends, identify top-performing products and countries, and visualize business data using charts and graphs. This project focuses mainly on practical implementation and helps students understand how real-world business data is analyzed before building Machine Learning.