Project 2_2

Project 2_2: Sales Data Visualization & EDA using Python

Exploratory Data Analysis (EDA) on Real Sales Dataset

Objective

In this project, we will perform Exploratory Data Analysis (EDA) on a real-world sales dataset.

We will learn:

  • how to load real datasets,

  • understand business sales data,

  • clean data,

  • analyze trends,

  • and visualize important business insights using Python.

This is exactly how data analysts and ML engineers start real projects in industries.

Cell 1: Import Required Libraries

import pandas as pd
import matplotlib.pyplot as plt

We are importing:

  • pandas → used for handling datasets

  • matplotlib.pyplot → used for graphs and visualization

These are the most commonly used libraries in Data Analysis and Machine Learning.

Cell 2: Load the Dataset

df = pd.read_csv("sales_data_sample.csv", encoding="latin1")

Here we load the CSV dataset using:

pd.read_csv()
  • sales_data_sample.csv → dataset file name

  • encoding="latin1" → helps avoid text reading issues in some datasets

After loading, the dataset is stored in variable df.

Cell 3: Display First 5 Rows

df.head()

head() displays the first 5 records.

This helps us:

  • verify dataset loading,

  • understand column names,

  • and observe sample values.

Cell 4: Check Dataset Shape

df.shape

shape gives:

(rows, columns)

Example:

(2823, 25)

Meaning:

  • 2823 rows

  • 25 columns

This tells us dataset size.

Cell 5: Check Dataset Information

df.info()

info() gives:

  • column names

  • number of non-null values

  • data types

  • memory usage

This helps identify:

  • missing values

  • incorrect data types

Cell 6: Statistical Summary

df.describe()

describe() gives statistical details for numerical columns.

It shows:

  • mean

  • minimum value

  • maximum value

  • standard deviation

  • percentiles

This helps understand sales distribution.

Cell 7: Check Missing Values

df.isnull().sum()

This checks missing values in each column.

Missing values are common in real-world datasets.

Before analysis or ML:

  • we must identify missing data,

  • and later clean or handle it.

Cell 8: View Column Names

df.columns

This displays all dataset columns.

Typical columns in sales datasets:

  • ORDERNUMBER

  • QUANTITYORDERED

  • PRICEEACH

  • SALES

  • PRODUCTLINE

  • CUSTOMERNAME

  • COUNTRY

  • DEALSIZE

Understanding columns is very important before analysis.

Cell 9: Select Important Columns

sales_df = df[[
"ORDERNUMBER",
"SALES",
"PRODUCTLINE",
"COUNTRY",
"YEAR_ID",
"MONTH_ID",
"DEALSIZE"
]]

sales_df.head()

Real datasets contain many columns.

For beginner analysis, we select only useful columns.

This makes the project:

  • easier,

  • cleaner,

  • and more understandable.

Cell 10: Total Sales

total_sales = sales_df["SALES"].sum()

print("Total Sales:", total_sales)

sum() adds all sales values.

This gives total business revenue.

This is one of the most basic business metrics.

Cell 11: Average Sales

average_sales = sales_df["SALES"].mean()

print("Average Sales:", average_sales)

mean() calculates average sales value.

This helps understand:

  • average customer purchase,

  • or average order amount.

Cell 12: Highest Sales Order

highest_sale = sales_df[sales_df["SALES"] == sales_df["SALES"].max()]

highest_sale

Here we identify the highest sales transaction.

max() gives the highest sales value.

Then we filter the dataset to show that record.

Cell 13: Lowest Sales Order

lowest_sale = sales_df[sales_df["SALES"] == sales_df["SALES"].min()]

lowest_sale

This identifies the lowest sales transaction.

Businesses often analyze:

  • highest-performing orders,

  • and weakest-performing orders.

Cell 14: Sales by Product Line

product_sales = sales_df.groupby("PRODUCTLINE")["SALES"].sum()

print(product_sales)

groupby() groups data based on categories.

Here we group by:

  • PRODUCTLINE

Then calculate:

  • total sales for each category

This helps identify best-selling product types.

Cell 15: Product Line Bar Chart

product_sales.plot(kind="bar")

plt.xlabel("Product Line")
plt.ylabel("Total Sales")
plt.title("Sales by Product Line")

plt.show()

This bar chart compares sales across product categories.

Businesses use this to:

  • identify profitable products,

  • improve inventory planning,

  • increase marketing focus.

Cell 16: Country-wise Sales Analysis

country_sales = sales_df.groupby("COUNTRY")["SALES"].sum()

country_sales.sort_values(ascending=False).head(10)

This calculates sales country-wise.

Then:

  • sorts sales in descending order,

  • displays top 10 countries.

This helps businesses identify strong markets.

Cell 17: Top Countries Sales Chart

country_sales.sort_values(ascending=False).head(10).plot(kind="bar")

plt.xlabel("Country")
plt.ylabel("Sales")
plt.title("Top 10 Countries by Sales")

plt.show()

This chart visualizes top-performing countries.

Businesses use this to:

  • focus regional marketing,

  • improve international sales strategies.

Cell 18: Monthly Sales Trend

monthly_sales = sales_df.groupby("MONTH_ID")["SALES"].sum()

print(monthly_sales)

This calculates total monthly sales.

Grouping by month helps identify:

  • seasonal trends,

  • peak business months,

  • weak sales periods.

Cell 19: Monthly Sales Line Chart

monthly_sales.plot(marker="o")

plt.xlabel("Month")
plt.ylabel("Sales")
plt.title("Monthly Sales Trend")

plt.show()

Line charts are best for trend analysis.

This graph helps observe:

  • increasing sales,

  • decreasing sales,

  • seasonal patterns.

Cell 20: Deal Size Analysis

deal_sales = sales_df.groupby("DEALSIZE")["SALES"].sum()

deal_sales

This analyzes sales based on deal size.

Typical deal sizes:

  • Small

  • Medium

  • Large

This helps businesses understand:

  • which deal size generates maximum revenue.

Cell 21: Deal Size Pie Chart

deal_sales.plot(
kind="pie",
autopct="%1.1f%%"
)

plt.title("Sales Distribution by Deal Size")

plt.ylabel("")

plt.show()

Pie charts show percentage contribution.

This graph shows:

  • how much revenue comes from each deal size category.

Cell 22: Year-wise Sales Analysis

year_sales = sales_df.groupby("YEAR_ID")["SALES"].sum()

print(year_sales)

This calculates yearly sales.

This helps businesses compare:

  • yearly growth,

  • yearly decline.

Cell 23: Year-wise Sales Bar Chart

year_sales.plot(kind="bar")

plt.xlabel("Year")
plt.ylabel("Sales")
plt.title("Year-wise Sales")

plt.show()

This chart compares business performance across years.

This is useful for:

  • business reporting,

  • annual growth analysis.

Cell 24: Correlation Analysis

sales_df.corr(numeric_only=True)

Correlation shows relationship between numerical columns.

Values range from:

-1 to +1

If close to:

  • +1 → strong positive relationship

  • -1 → strong negative relationship

This helps identify patterns before Machine Learning.

Cell 25: Final Business Insights

print("Business Insights:")
print("1. Total business sales calculated successfully.")
print("2. Product lines with highest sales identified.")
print("3. Top-performing countries identified.")
print("4. Monthly sales trends analyzed.")
print("5. Deal size contribution analyzed.")
print("6. Year-wise business growth visualized.")

This is the final summary of our EDA.

EDA is all about:

  • understanding the data,

  • identifying patterns,

  • and extracting useful business insights.

Summary

In this project, we performed Exploratory Data Analysis (EDA) on a real-world sales dataset using Python. We learned how to load datasets, analyze sales trends, identify top-performing products and countries, and visualize business data using charts and graphs. This project focuses mainly on practical implementation and helps students understand how real-world business data is analyzed before building Machine Learning.

Previous Topic Project 2_1 Next Topic Project 3_1