Project 2_2 - Machine Learning

Project 2_2: Sales Data Visualization & EDA using Python

Exploratory Data Analysis (EDA) on Real Sales Dataset

Objective

In this project, we will perform Exploratory Data Analysis (EDA) on a real-world sales dataset.

We will learn:

how to load real datasets,
understand business sales data,
clean data,
analyze trends,
and visualize important business insights using Python.

This is exactly how data analysts and ML engineers start real projects in industries.

Cell 1: Import Required Libraries

import pandas as pd
import matplotlib.pyplot as plt

We are importing:

pandas → used for handling datasets
matplotlib.pyplot → used for graphs and visualization

These are the most commonly used libraries in Data Analysis and Machine Learning.

Cell 2: Load the Dataset

df = pd.read_csv("sales_data_sample.csv", encoding="latin1")

Here we load the CSV dataset using:

pd.read_csv()

sales_data_sample.csv → dataset file name
encoding="latin1" → helps avoid text reading issues in some datasets

After loading, the dataset is stored in variable df.

Cell 3: Display First 5 Rows

df.head()

head() displays the first 5 records.

This helps us:

verify dataset loading,
understand column names,
and observe sample values.

Cell 4: Check Dataset Shape

df.shape

shape gives:

(rows, columns)

Example:

(2823, 25)

Meaning:

2823 rows
25 columns

This tells us dataset size.

Cell 5: Check Dataset Information

df.info()

info() gives:

column names
number of non-null values
data types
memory usage

This helps identify:

missing values
incorrect data types

Cell 6: Statistical Summary

df.describe()

describe() gives statistical details for numerical columns.

It shows:

mean
minimum value
maximum value
standard deviation
percentiles

This helps understand sales distribution.

Cell 7: Check Missing Values

df.isnull().sum()

This checks missing values in each column.

Missing values are common in real-world datasets.

Before analysis or ML:

we must identify missing data,
and later clean or handle it.

Cell 8: View Column Names

df.columns

This displays all dataset columns.

Typical columns in sales datasets:

ORDERNUMBER
QUANTITYORDERED
PRICEEACH
SALES
PRODUCTLINE
CUSTOMERNAME
COUNTRY
DEALSIZE

Understanding columns is very important before analysis.

Cell 9: Select Important Columns

sales_df = df[[
    "ORDERNUMBER",
    "SALES",
    "PRODUCTLINE",
    "COUNTRY",
    "YEAR_ID",
    "MONTH_ID",
    "DEALSIZE"
]]

sales_df.head()

Real datasets contain many columns.

For beginner analysis, we select only useful columns.

This makes the project:

easier,
cleaner,
and more understandable.

Cell 10: Total Sales

total_sales = sales_df["SALES"].sum()

print("Total Sales:", total_sales)

sum() adds all sales values.

This gives total business revenue.

This is one of the most basic business metrics.

Cell 11: Average Sales

average_sales = sales_df["SALES"].mean()

print("Average Sales:", average_sales)

mean() calculates average sales value.

This helps understand:

average customer purchase,
or average order amount.

Cell 12: Highest Sales Order

highest_sale = sales_df[sales_df["SALES"] == sales_df["SALES"].max()]

highest_sale

Here we identify the highest sales transaction.

max() gives the highest sales value.

Then we filter the dataset to show that record.

Cell 13: Lowest Sales Order

lowest_sale = sales_df[sales_df["SALES"] == sales_df["SALES"].min()]

lowest_sale

This identifies the lowest sales transaction.

Businesses often analyze:

highest-performing orders,
and weakest-performing orders.

Cell 14: Sales by Product Line

product_sales = sales_df.groupby("PRODUCTLINE")["SALES"].sum()

print(product_sales)

groupby() groups data based on categories.

Here we group by:

PRODUCTLINE

Then calculate:

total sales for each category

This helps identify best-selling product types.

Cell 15: Product Line Bar Chart

product_sales.plot(kind="bar")

plt.xlabel("Product Line")
plt.ylabel("Total Sales")
plt.title("Sales by Product Line")

plt.show()

This bar chart compares sales across product categories.

Businesses use this to:

identify profitable products,
improve inventory planning,
increase marketing focus.

Cell 16: Country-wise Sales Analysis

country_sales = sales_df.groupby("COUNTRY")["SALES"].sum()

country_sales.sort_values(ascending=False).head(10)

This calculates sales country-wise.

Then:

sorts sales in descending order,
displays top 10 countries.

This helps businesses identify strong markets.

Cell 17: Top Countries Sales Chart

country_sales.sort_values(ascending=False).head(10).plot(kind="bar")

plt.xlabel("Country")
plt.ylabel("Sales")
plt.title("Top 10 Countries by Sales")

plt.show()

This chart visualizes top-performing countries.

Businesses use this to:

focus regional marketing,
improve international sales strategies.

Cell 18: Monthly Sales Trend

monthly_sales = sales_df.groupby("MONTH_ID")["SALES"].sum()

print(monthly_sales)

This calculates total monthly sales.

Grouping by month helps identify:

seasonal trends,
peak business months,
weak sales periods.

Cell 19: Monthly Sales Line Chart

monthly_sales.plot(marker="o")

plt.xlabel("Month")
plt.ylabel("Sales")
plt.title("Monthly Sales Trend")

plt.show()

Line charts are best for trend analysis.

This graph helps observe:

increasing sales,
decreasing sales,
seasonal patterns.

Cell 20: Deal Size Analysis

deal_sales = sales_df.groupby("DEALSIZE")["SALES"].sum()

deal_sales

This analyzes sales based on deal size.

Typical deal sizes:

Small
Medium
Large

This helps businesses understand:

which deal size generates maximum revenue.

Cell 21: Deal Size Pie Chart

deal_sales.plot(
    kind="pie",
    autopct="%1.1f%%"
)

plt.title("Sales Distribution by Deal Size")

plt.ylabel("")

plt.show()

Pie charts show percentage contribution.

This graph shows:

how much revenue comes from each deal size category.

Cell 22: Year-wise Sales Analysis

year_sales = sales_df.groupby("YEAR_ID")["SALES"].sum()

print(year_sales)

This calculates yearly sales.

This helps businesses compare:

yearly growth,
yearly decline.

Cell 23: Year-wise Sales Bar Chart

year_sales.plot(kind="bar")

plt.xlabel("Year")
plt.ylabel("Sales")
plt.title("Year-wise Sales")

plt.show()

This chart compares business performance across years.

This is useful for:

business reporting,
annual growth analysis.

Cell 24: Correlation Analysis

sales_df.corr(numeric_only=True)

Correlation shows relationship between numerical columns.

Values range from:

-1 to +1

If close to:

+1 → strong positive relationship
-1 → strong negative relationship

This helps identify patterns before Machine Learning.

Cell 25: Final Business Insights

print("Business Insights:")
print("1. Total business sales calculated successfully.")
print("2. Product lines with highest sales identified.")
print("3. Top-performing countries identified.")
print("4. Monthly sales trends analyzed.")
print("5. Deal size contribution analyzed.")
print("6. Year-wise business growth visualized.")

This is the final summary of our EDA.

EDA is all about:

understanding the data,
identifying patterns,
and extracting useful business insights.

Summary

In this project, we performed Exploratory Data Analysis (EDA) on a real-world sales dataset using Python. We learned how to load datasets, analyze sales trends, identify top-performing products and countries, and visualize business data using charts and graphs. This project focuses mainly on practical implementation and helps students understand how real-world business data is analyzed before building Machine Learning.