Project 3_1 - Machine Learning

Project 3_1: Employee Salary Analysis using Python

Objective

In this project, we will analyze employee salary data using Python.

We will learn how to:

create an employee dataset
analyze salary details
compare departments
understand salary distribution
visualize experience vs salary
extract HR business insights

This project is useful for understanding basic HR analytics using Python.

Cell 1: Import Required Libraries

import pandas as pd
import matplotlib.pyplot as plt

Here we are importing two important libraries.

pandas is used to work with data in table format.

matplotlib.pyplot is used to create graphs and visualizations.

Cell 2: Create Employee Dataset

data = {
    "Employee": [
        "Arun", "Sneha", "Rahul", "Priya", "Kiran",
        "Anjali", "Ravi", "Meena", "John", "Asha"
    ],
    "Department": [
        "IT", "HR", "Finance", "IT", "Marketing",
        "HR", "Finance", "IT", "Marketing", "Finance"
    ],
    "Experience": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    "Salary": [
        25000, 30000, 40000, 50000, 45000,
        55000, 65000, 70000, 75000, 85000
    ]
}

Here we are creating a small employee dataset manually.

The dataset contains:

Employee: employee name
Department: department name
Experience: work experience in years
Salary: employee salary

This type of data is commonly used in HR analytics.

Cell 3: Convert Data into DataFrame

df = pd.DataFrame(data)

df

Here we convert the dictionary into a Pandas DataFrame.

A DataFrame is like a table with rows and columns.

In data analysis and machine learning, most datasets are handled in DataFrame format.

Cell 4: Display First Few Records

df.head()

head() displays the first five rows of the dataset.

This helps us quickly check whether the data is created correctly or not.

Cell 5: Check Dataset Shape

df.shape

shape tells us how many rows and columns are available in the dataset.

Example output:

(10, 4)

This means the dataset contains:

10 rows
4 columns

Cell 6: Check Dataset Information

df.info()

info() gives basic information about the dataset.

It shows:

column names
number of non-empty values
data type of each column

This helps us check whether the data contains missing values or wrong data types.

Cell 7: Statistical Summary

df.describe()

describe() gives statistical information about numerical columns.

It shows:

count
mean
minimum value
maximum value
standard deviation
25%, 50%, and 75% values

This helps us understand salary and experience distribution.

Cell 8: Check Missing Values

df.isnull().sum()

This checks whether there are any missing values in the dataset.

isnull() checks missing values.

sum() counts missing values in each column.

If the output is 0 for every column, it means there are no missing values.

Cell 9: Average Salary

average_salary = df["Salary"].mean()

print("Average Salary:", average_salary)

Here we calculate the average salary of all employees.

mean() is used to calculate the average value.

This helps HR teams understand the overall salary level in the company.

Cell 10: Highest Paid Employee

highest_salary = df[df["Salary"] == df["Salary"].max()]

highest_salary

Here we find the employee who has the highest salary.

max() gives the maximum salary value.

Then we filter the dataset to display the employee with that salary.

Cell 11: Lowest Paid Employee

lowest_salary = df[df["Salary"] == df["Salary"].min()]

lowest_salary

Here we find the employee who has the lowest salary.

min() gives the minimum salary value.

This helps identify the lowest salary range in the company.

Cell 12: Sort Employees by Salary

df.sort_values(by="Salary", ascending=False)

This sorts employees based on salary.

ascending=False means highest salary will come first.

This helps us quickly see the top earning employees.

Cell 13: Top 3 Highest Paid Employees

top_3 = df.sort_values(by="Salary", ascending=False).head(3)

top_3

This displays the top 3 highest paid employees.

Companies often analyze top earners for compensation planning and salary structure review.

Cell 14: Department-wise Average Salary

department_salary = df.groupby("Department")["Salary"].mean()

print(department_salary)

Here we group employees based on department.

Then we calculate the average salary for each department.

groupby() is very useful when we want category-wise analysis.

Cell 15: Department-wise Salary Bar Chart

department_salary.plot(kind="bar")

plt.xlabel("Department")
plt.ylabel("Average Salary")
plt.title("Department-wise Average Salary")

plt.show()

This bar chart compares average salary across departments.

It helps HR teams understand which department has higher average salary.

This is useful for payroll and budgeting analysis.

Cell 16: Experience vs Salary Scatter Plot

plt.scatter(df["Experience"], df["Salary"])

plt.xlabel("Experience")
plt.ylabel("Salary")
plt.title("Experience vs Salary")

plt.show()

This scatter plot shows the relationship between experience and salary.

Each point represents one employee.

From this graph, we can observe:

More experience → Higher salary

This shows a positive relationship.

Cell 17: Employee Salary Comparison

plt.bar(df["Employee"], df["Salary"])

plt.xlabel("Employee")
plt.ylabel("Salary")
plt.title("Employee Salary Comparison")

plt.show()

This bar chart compares salaries of individual employees.

Each bar represents one employee.

This helps us compare employee salaries visually.

Cell 18: Salary Distribution Histogram

plt.hist(df["Salary"], bins=5)

plt.xlabel("Salary")
plt.ylabel("Number of Employees")
plt.title("Salary Distribution")

plt.show()

A histogram shows how salary values are distributed.

It helps us understand:

how many employees are in low salary range
how many employees are in medium salary range
how many employees are in high salary range

This is useful for payroll analysis.

Cell 19: Box Plot for Salary Analysis

plt.boxplot(df["Salary"])

plt.ylabel("Salary")
plt.title("Salary Box Plot")

plt.show()

A box plot helps us understand salary spread.

It shows:

minimum salary
maximum salary
median salary
salary spread
possible outliers

This is useful when analyzing salary distribution in companies.

Cell 20: Department-wise Average Experience

department_experience = df.groupby("Department")["Experience"].mean()

print(department_experience)

Here we calculate the average experience of employees in each department.

This helps us understand which department has more experienced employees.

Cell 21: Department-wise Experience Chart

department_experience.plot(kind="line", marker="o")

plt.xlabel("Department")
plt.ylabel("Average Experience")
plt.title("Department-wise Average Experience")

plt.show()

This line chart shows average experience across departments.

It helps management understand workforce experience distribution.

Cell 22: Salary Growth Trend

sorted_df = df.sort_values(by="Experience")

plt.plot(
    sorted_df["Experience"],
    sorted_df["Salary"],
    marker="o"
)

plt.xlabel("Experience")
plt.ylabel("Salary")
plt.title("Salary Growth Trend")

plt.show()

This graph shows how salary changes as experience increases.

It helps us clearly observe salary growth pattern.

In most companies:

Experience increases → Salary also increases

Cell 23: Create Salary Category

def salary_category(salary):

    if salary < 40000:
        return "Low"

    elif salary < 70000:
        return "Medium"

    else:
        return "High"


df["Salary_Category"] = df["Salary"].apply(salary_category)

df

Here we create a new column called Salary_Category.

We divide employees into three groups:

Low salary
Medium salary
High salary

This is an example of feature engineering.

Feature engineering means creating new useful columns from existing data.

Cell 24: Salary Category Count

salary_category_count = df["Salary_Category"].value_counts()

print(salary_category_count)

This counts how many employees are present in each salary category.

It helps us understand salary group distribution in the company.

Cell 25: Salary Category Distribution Chart

salary_category_count.plot(kind="bar")

plt.xlabel("Salary Category")
plt.ylabel("Employee Count")
plt.title("Salary Category Distribution")

plt.show()

This bar chart shows the number of employees in each salary category.

It helps HR teams analyze payroll structure.

Cell 26: Department-wise Employee Count

department_count = df["Department"].value_counts()

print(department_count)

This counts how many employees are working in each department.

This helps companies understand department-wise workforce distribution.

Cell 27: Department Employee Distribution Pie Chart

plt.pie(
    department_count,
    labels=department_count.index,
    autopct="%1.1f%%"
)

plt.title("Employee Distribution by Department")

plt.show()

This pie chart shows employee percentage in each department.

It helps management understand how employees are distributed across departments.

Cell 28: Correlation Analysis

df[["Experience", "Salary"]].corr()

Correlation shows the relationship between two numerical columns.

The value ranges from:

-1 to +1

If the value is close to +1, it means both values increase together.

Here we check the relationship between:

Experience
Salary

Usually, more experience leads to higher salary.

Cell 29: Final HR Insights

print("Final HR Insights:")
print("1. Average employee salary calculated.")
print("2. Highest and lowest paid employees identified.")
print("3. Department-wise salary analysis completed.")
print("4. Experience and salary relationship visualized.")
print("5. Salary distribution analyzed using histogram and box plot.")
print("6. Employees categorized into Low, Medium, and High salary groups.")
print("7. Department-wise employee distribution analyzed.")

In real-world companies, HR analytics is used for:

salary planning
hiring decisions
payroll management
workforce analysis
department budgeting

Complete Code in One Place

import pandas as pd
import matplotlib.pyplot as plt

data = {
    "Employee": [
        "Arun", "Sneha", "Rahul", "Priya", "Kiran",
        "Anjali", "Ravi", "Meena", "John", "Asha"
    ],
    "Department": [
        "IT", "HR", "Finance", "IT", "Marketing",
        "HR", "Finance", "IT", "Marketing", "Finance"
    ],
    "Experience": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    "Salary": [
        25000, 30000, 40000, 50000, 45000,
        55000, 65000, 70000, 75000, 85000
    ]
}

df = pd.DataFrame(data)

print("Employee Dataset:")
print(df)

print("\nFirst Five Records:")
print(df.head())

print("\nDataset Shape:")
print(df.shape)

print("\nDataset Information:")
print(df.info())

print("\nStatistical Summary:")
print(df.describe())

print("\nMissing Values:")
print(df.isnull().sum())

average_salary = df["Salary"].mean()
print("\nAverage Salary:", average_salary)

highest_salary = df[df["Salary"] == df["Salary"].max()]
print("\nHighest Paid Employee:")
print(highest_salary)

lowest_salary = df[df["Salary"] == df["Salary"].min()]
print("\nLowest Paid Employee:")
print(lowest_salary)

print("\nEmployees Sorted by Salary:")
print(df.sort_values(by="Salary", ascending=False))

top_3 = df.sort_values(by="Salary", ascending=False).head(3)
print("\nTop 3 Highest Paid Employees:")
print(top_3)

department_salary = df.groupby("Department")["Salary"].mean()
print("\nDepartment-wise Average Salary:")
print(department_salary)

department_salary.plot(kind="bar")
plt.xlabel("Department")
plt.ylabel("Average Salary")
plt.title("Department-wise Average Salary")
plt.show()

plt.scatter(df["Experience"], df["Salary"])
plt.xlabel("Experience")
plt.ylabel("Salary")
plt.title("Experience vs Salary")
plt.show()

plt.bar(df["Employee"], df["Salary"])
plt.xlabel("Employee")
plt.ylabel("Salary")
plt.title("Employee Salary Comparison")
plt.show()

plt.hist(df["Salary"], bins=5)
plt.xlabel("Salary")
plt.ylabel("Number of Employees")
plt.title("Salary Distribution")
plt.show()

plt.boxplot(df["Salary"])
plt.ylabel("Salary")
plt.title("Salary Box Plot")
plt.show()

department_experience = df.groupby("Department")["Experience"].mean()
print("\nDepartment-wise Average Experience:")
print(department_experience)

department_experience.plot(kind="line", marker="o")
plt.xlabel("Department")
plt.ylabel("Average Experience")
plt.title("Department-wise Average Experience")
plt.show()

sorted_df = df.sort_values(by="Experience")

plt.plot(
    sorted_df["Experience"],
    sorted_df["Salary"],
    marker="o"
)
plt.xlabel("Experience")
plt.ylabel("Salary")
plt.title("Salary Growth Trend")
plt.show()

def salary_category(salary):

    if salary < 40000:
        return "Low"

    elif salary < 70000:
        return "Medium"

    else:
        return "High"


df["Salary_Category"] = df["Salary"].apply(salary_category)

print("\nSalary Category:")
print(df[["Employee", "Salary", "Salary_Category"]])

salary_category_count = df["Salary_Category"].value_counts()
print("\nSalary Category Count:")
print(salary_category_count)

salary_category_count.plot(kind="bar")
plt.xlabel("Salary Category")
plt.ylabel("Employee Count")
plt.title("Salary Category Distribution")
plt.show()

department_count = df["Department"].value_counts()
print("\nDepartment-wise Employee Count:")
print(department_count)

plt.pie(
    department_count,
    labels=department_count.index,
    autopct="%1.1f%%"
)
plt.title("Employee Distribution by Department")
plt.show()

print("\nCorrelation:")
print(df[["Experience", "Salary"]].corr())

print("\nFinal HR Insights:")
print("1. Average employee salary calculated.")
print("2. Highest and lowest paid employees identified.")
print("3. Department-wise salary analysis completed.")
print("4. Experience and salary relationship visualized.")
print("5. Salary distribution analyzed using histogram and box plot.")
print("6. Employees categorized into Low, Medium, and High salary groups.")
print("7. Department-wise employee distribution analyzed.")

Summary

In this project, we performed Employee Salary Analysis using Python. We created an employee dataset, analyzed salary details, compared departments, visualized salary distribution, and studied the relationship between experience and salary. This project introduced basic HR analytics concepts and showed how companies can use data analysis for salary planning, workforce management, and business decision-making.