Project 3_1
Project 3_1: Employee Salary Analysis using Python
Objective
In this project, we will analyze employee salary data using Python.
We will learn how to:
-
create an employee dataset
-
analyze salary details
-
compare departments
-
understand salary distribution
-
visualize experience vs salary
-
extract HR business insights
This project is useful for understanding basic HR analytics using Python.
Cell 1: Import Required Libraries
import pandas as pd
import matplotlib.pyplot as plt
Here we are importing two important libraries.
pandas is used to work with data in table format.
matplotlib.pyplot is used to create graphs and visualizations.
Cell 2: Create Employee Dataset
data = {
"Employee": [
"Arun", "Sneha", "Rahul", "Priya", "Kiran",
"Anjali", "Ravi", "Meena", "John", "Asha"
],
"Department": [
"IT", "HR", "Finance", "IT", "Marketing",
"HR", "Finance", "IT", "Marketing", "Finance"
],
"Experience": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"Salary": [
25000, 30000, 40000, 50000, 45000,
55000, 65000, 70000, 75000, 85000
]
}
Here we are creating a small employee dataset manually.
The dataset contains:
-
Employee: employee name
-
Department: department name
-
Experience: work experience in years
-
Salary: employee salary
This type of data is commonly used in HR analytics.
Cell 3: Convert Data into DataFrame
df = pd.DataFrame(data)
df
Here we convert the dictionary into a Pandas DataFrame.
A DataFrame is like a table with rows and columns.
In data analysis and machine learning, most datasets are handled in DataFrame format.
Cell 4: Display First Few Records
df.head()
head() displays the first five rows of the dataset.
This helps us quickly check whether the data is created correctly or not.
Cell 5: Check Dataset Shape
df.shape
shape tells us how many rows and columns are available in the dataset.
Example output:
(10, 4)
This means the dataset contains:
-
10 rows
-
4 columns
Cell 6: Check Dataset Information
df.info()
info() gives basic information about the dataset.
It shows:
-
column names
-
number of non-empty values
-
data type of each column
This helps us check whether the data contains missing values or wrong data types.
Cell 7: Statistical Summary
df.describe()
describe() gives statistical information about numerical columns.
It shows:
-
count
-
mean
-
minimum value
-
maximum value
-
standard deviation
-
25%, 50%, and 75% values
This helps us understand salary and experience distribution.
Cell 8: Check Missing Values
df.isnull().sum()
This checks whether there are any missing values in the dataset.
isnull() checks missing values.
sum() counts missing values in each column.
If the output is 0 for every column, it means there are no missing values.
Cell 9: Average Salary
average_salary = df["Salary"].mean()
print("Average Salary:", average_salary)
Here we calculate the average salary of all employees.
mean() is used to calculate the average value.
This helps HR teams understand the overall salary level in the company.
Cell 10: Highest Paid Employee
highest_salary = df[df["Salary"] == df["Salary"].max()]
highest_salary
Here we find the employee who has the highest salary.
max() gives the maximum salary value.
Then we filter the dataset to display the employee with that salary.
Cell 11: Lowest Paid Employee
lowest_salary = df[df["Salary"] == df["Salary"].min()]
lowest_salary
Here we find the employee who has the lowest salary.
min() gives the minimum salary value.
This helps identify the lowest salary range in the company.
Cell 12: Sort Employees by Salary
df.sort_values(by="Salary", ascending=False)
This sorts employees based on salary.
ascending=False means highest salary will come first.
This helps us quickly see the top earning employees.
Cell 13: Top 3 Highest Paid Employees
top_3 = df.sort_values(by="Salary", ascending=False).head(3)
top_3
This displays the top 3 highest paid employees.
Companies often analyze top earners for compensation planning and salary structure review.
Cell 14: Department-wise Average Salary
department_salary = df.groupby("Department")["Salary"].mean()
print(department_salary)
Here we group employees based on department.
Then we calculate the average salary for each department.
groupby() is very useful when we want category-wise analysis.
Cell 15: Department-wise Salary Bar Chart
department_salary.plot(kind="bar")
plt.xlabel("Department")
plt.ylabel("Average Salary")
plt.title("Department-wise Average Salary")
plt.show()
This bar chart compares average salary across departments.
It helps HR teams understand which department has higher average salary.
This is useful for payroll and budgeting analysis.
Cell 16: Experience vs Salary Scatter Plot
plt.scatter(df["Experience"], df["Salary"])
plt.xlabel("Experience")
plt.ylabel("Salary")
plt.title("Experience vs Salary")
plt.show()
This scatter plot shows the relationship between experience and salary.
Each point represents one employee.
From this graph, we can observe:
More experience → Higher salary
This shows a positive relationship.
Cell 17: Employee Salary Comparison
plt.bar(df["Employee"], df["Salary"])
plt.xlabel("Employee")
plt.ylabel("Salary")
plt.title("Employee Salary Comparison")
plt.show()
This bar chart compares salaries of individual employees.
Each bar represents one employee.
This helps us compare employee salaries visually.
Cell 18: Salary Distribution Histogram
plt.hist(df["Salary"], bins=5)
plt.xlabel("Salary")
plt.ylabel("Number of Employees")
plt.title("Salary Distribution")
plt.show()
A histogram shows how salary values are distributed.
It helps us understand:
-
how many employees are in low salary range
-
how many employees are in medium salary range
-
how many employees are in high salary range
This is useful for payroll analysis.
Cell 19: Box Plot for Salary Analysis
plt.boxplot(df["Salary"])
plt.ylabel("Salary")
plt.title("Salary Box Plot")
plt.show()
A box plot helps us understand salary spread.
It shows:
-
minimum salary
-
maximum salary
-
median salary
-
salary spread
-
possible outliers
This is useful when analyzing salary distribution in companies.
Cell 20: Department-wise Average Experience
department_experience = df.groupby("Department")["Experience"].mean()
print(department_experience)
Here we calculate the average experience of employees in each department.
This helps us understand which department has more experienced employees.
Cell 21: Department-wise Experience Chart
department_experience.plot(kind="line", marker="o")
plt.xlabel("Department")
plt.ylabel("Average Experience")
plt.title("Department-wise Average Experience")
plt.show()
This line chart shows average experience across departments.
It helps management understand workforce experience distribution.
Cell 22: Salary Growth Trend
sorted_df = df.sort_values(by="Experience")
plt.plot(
sorted_df["Experience"],
sorted_df["Salary"],
marker="o"
)
plt.xlabel("Experience")
plt.ylabel("Salary")
plt.title("Salary Growth Trend")
plt.show()
This graph shows how salary changes as experience increases.
It helps us clearly observe salary growth pattern.
In most companies:
Experience increases → Salary also increases
Cell 23: Create Salary Category
def salary_category(salary):
if salary < 40000:
return "Low"
elif salary < 70000:
return "Medium"
else:
return "High"
df["Salary_Category"] = df["Salary"].apply(salary_category)
df
Here we create a new column called Salary_Category.
We divide employees into three groups:
-
Low salary
-
Medium salary
-
High salary
This is an example of feature engineering.
Feature engineering means creating new useful columns from existing data.
Cell 24: Salary Category Count
salary_category_count = df["Salary_Category"].value_counts()
print(salary_category_count)
This counts how many employees are present in each salary category.
It helps us understand salary group distribution in the company.
Cell 25: Salary Category Distribution Chart
salary_category_count.plot(kind="bar")
plt.xlabel("Salary Category")
plt.ylabel("Employee Count")
plt.title("Salary Category Distribution")
plt.show()
This bar chart shows the number of employees in each salary category.
It helps HR teams analyze payroll structure.
Cell 26: Department-wise Employee Count
department_count = df["Department"].value_counts()
print(department_count)
This counts how many employees are working in each department.
This helps companies understand department-wise workforce distribution.
Cell 27: Department Employee Distribution Pie Chart
plt.pie(
department_count,
labels=department_count.index,
autopct="%1.1f%%"
)
plt.title("Employee Distribution by Department")
plt.show()
This pie chart shows employee percentage in each department.
It helps management understand how employees are distributed across departments.
Cell 28: Correlation Analysis
df[["Experience", "Salary"]].corr()
Correlation shows the relationship between two numerical columns.
The value ranges from:
-1 to +1
If the value is close to +1, it means both values increase together.
Here we check the relationship between:
-
Experience
-
Salary
Usually, more experience leads to higher salary.
Cell 29: Final HR Insights
print("Final HR Insights:")
print("1. Average employee salary calculated.")
print("2. Highest and lowest paid employees identified.")
print("3. Department-wise salary analysis completed.")
print("4. Experience and salary relationship visualized.")
print("5. Salary distribution analyzed using histogram and box plot.")
print("6. Employees categorized into Low, Medium, and High salary groups.")
print("7. Department-wise employee distribution analyzed.")
In real-world companies, HR analytics is used for:
-
salary planning
-
hiring decisions
-
payroll management
-
workforce analysis
-
department budgeting
Complete Code in One Place
import pandas as pd
import matplotlib.pyplot as plt
data = {
"Employee": [
"Arun", "Sneha", "Rahul", "Priya", "Kiran",
"Anjali", "Ravi", "Meena", "John", "Asha"
],
"Department": [
"IT", "HR", "Finance", "IT", "Marketing",
"HR", "Finance", "IT", "Marketing", "Finance"
],
"Experience": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"Salary": [
25000, 30000, 40000, 50000, 45000,
55000, 65000, 70000, 75000, 85000
]
}
df = pd.DataFrame(data)
print("Employee Dataset:")
print(df)
print("\nFirst Five Records:")
print(df.head())
print("\nDataset Shape:")
print(df.shape)
print("\nDataset Information:")
print(df.info())
print("\nStatistical Summary:")
print(df.describe())
print("\nMissing Values:")
print(df.isnull().sum())
average_salary = df["Salary"].mean()
print("\nAverage Salary:", average_salary)
highest_salary = df[df["Salary"] == df["Salary"].max()]
print("\nHighest Paid Employee:")
print(highest_salary)
lowest_salary = df[df["Salary"] == df["Salary"].min()]
print("\nLowest Paid Employee:")
print(lowest_salary)
print("\nEmployees Sorted by Salary:")
print(df.sort_values(by="Salary", ascending=False))
top_3 = df.sort_values(by="Salary", ascending=False).head(3)
print("\nTop 3 Highest Paid Employees:")
print(top_3)
department_salary = df.groupby("Department")["Salary"].mean()
print("\nDepartment-wise Average Salary:")
print(department_salary)
department_salary.plot(kind="bar")
plt.xlabel("Department")
plt.ylabel("Average Salary")
plt.title("Department-wise Average Salary")
plt.show()
plt.scatter(df["Experience"], df["Salary"])
plt.xlabel("Experience")
plt.ylabel("Salary")
plt.title("Experience vs Salary")
plt.show()
plt.bar(df["Employee"], df["Salary"])
plt.xlabel("Employee")
plt.ylabel("Salary")
plt.title("Employee Salary Comparison")
plt.show()
plt.hist(df["Salary"], bins=5)
plt.xlabel("Salary")
plt.ylabel("Number of Employees")
plt.title("Salary Distribution")
plt.show()
plt.boxplot(df["Salary"])
plt.ylabel("Salary")
plt.title("Salary Box Plot")
plt.show()
department_experience = df.groupby("Department")["Experience"].mean()
print("\nDepartment-wise Average Experience:")
print(department_experience)
department_experience.plot(kind="line", marker="o")
plt.xlabel("Department")
plt.ylabel("Average Experience")
plt.title("Department-wise Average Experience")
plt.show()
sorted_df = df.sort_values(by="Experience")
plt.plot(
sorted_df["Experience"],
sorted_df["Salary"],
marker="o"
)
plt.xlabel("Experience")
plt.ylabel("Salary")
plt.title("Salary Growth Trend")
plt.show()
def salary_category(salary):
if salary < 40000:
return "Low"
elif salary < 70000:
return "Medium"
else:
return "High"
df["Salary_Category"] = df["Salary"].apply(salary_category)
print("\nSalary Category:")
print(df[["Employee", "Salary", "Salary_Category"]])
salary_category_count = df["Salary_Category"].value_counts()
print("\nSalary Category Count:")
print(salary_category_count)
salary_category_count.plot(kind="bar")
plt.xlabel("Salary Category")
plt.ylabel("Employee Count")
plt.title("Salary Category Distribution")
plt.show()
department_count = df["Department"].value_counts()
print("\nDepartment-wise Employee Count:")
print(department_count)
plt.pie(
department_count,
labels=department_count.index,
autopct="%1.1f%%"
)
plt.title("Employee Distribution by Department")
plt.show()
print("\nCorrelation:")
print(df[["Experience", "Salary"]].corr())
print("\nFinal HR Insights:")
print("1. Average employee salary calculated.")
print("2. Highest and lowest paid employees identified.")
print("3. Department-wise salary analysis completed.")
print("4. Experience and salary relationship visualized.")
print("5. Salary distribution analyzed using histogram and box plot.")
print("6. Employees categorized into Low, Medium, and High salary groups.")
print("7. Department-wise employee distribution analyzed.")
Summary
In this project, we performed Employee Salary Analysis using Python. We created an employee dataset, analyzed salary details, compared departments, visualized salary distribution, and studied the relationship between experience and salary. This project introduced basic HR analytics concepts and showed how companies can use data analysis for salary planning, workforce management, and business decision-making.