Project 1
Project 1: Student Study Analysis
Objective
In this project, we will analyze student study data and understand the relationship between:
- Study hours
- Attendance
- Marks
This is a beginner-level data analysis project before starting ML model training.
Cell 1: Import Required Libraries
import pandas as pd
import matplotlib.pyplot as plt
Here we are importing two important Python libraries.
pandas is used for working with data in table format.
matplotlib.pyplot is used for creating graphs and visualizations.
Cell 2: Create Student Dataset
data = {
"Student": ["Amit", "Sneha", "Rahul", "Priya", "Kiran", "Anjali", "Ravi", "Meena"],
"Study_Hours": [1, 2, 3, 4, 5, 6, 7, 8],
"Attendance": [50, 55, 60, 65, 70, 75, 85, 90],
"Marks": [35, 40, 50, 55, 65, 70, 85, 90]
}
Here we are creating a small student dataset manually.
This dataset contains four columns:
- Student: student name
- Study_Hours: how many hours the student studies
- Attendance: attendance percentage
- Marks: marks scored by the student
We are not using any external file. The data is created directly inside Python.
Cell 3: Convert Data into DataFrame
df = pd.DataFrame(data)
print(df)
Here we convert the dictionary data into a Pandas DataFrame.
A DataFrame is like a table with rows and columns.
In Machine Learning, most datasets are handled in DataFrame format.
Cell 4: Display First Few Records
df.head()
head() is used to display the first five rows of the dataset.
This helps us quickly check whether the data is loaded correctly or not.
Cell 5: Check Dataset Shape
df.shape
shape tells us the number of rows and columns in the dataset.
For example:
(8, 4)
This means the dataset has:
- 8 rows
- 4 columns
Cell 6: Check Dataset Information
df.info()
info() gives basic information about the dataset.
It shows:
- column names
- number of non-empty values
- data type of each column
This is useful to check missing values and data types.
Cell 7: Statistical Summary
df.describe()
describe() gives statistical information about numerical columns.
It shows:
- count
- mean
- minimum value
- maximum value
- standard deviation
- 25%, 50%, and 75% values
This helps us understand the data distribution.
Cell 8: Check Missing Values
df.isnull().sum()
This cell checks whether there are any missing values in the dataset.
isnull() checks missing values.
sum() counts how many missing values are present in each column.
If output is 0 for every column, it means there are no missing values.
Cell 9: Sort Students by Marks
df.sort_values(by="Marks", ascending=False)
This sorts the students based on marks.
ascending=False means highest marks will come first.
This helps us identify top-performing students.
Cell 10: Average Study Hours
average_study_hours = df["Study_Hours"].mean()
print("Average Study Hours:", average_study_hours)
Here we calculate the average study hours of all students.
mean() is used to calculate the average value.
This tells us how many hours students study on average.
Cell 11: Average Marks
average_marks = df["Marks"].mean()
print("Average Marks:", average_marks)
Here we calculate the average marks of all students.
This helps us understand the overall performance of students.
Cell 12: Student with Highest Marks
top_student = df[df["Marks"] == df["Marks"].max()]
top_student
Here we find the student who scored the highest marks.
max() gives the maximum marks.
Then we filter the dataset to show the student with those marks.
Cell 13: Student with Lowest Marks
low_student = df[df["Marks"] == df["Marks"].min()]
low_student
Here we find the student who scored the lowest marks.
min() gives the minimum marks.
This helps us identify students who may need improvement.
Cell 14: Study Hours vs Marks Graph
plt.scatter(df["Study_Hours"], df["Marks"])
plt.xlabel("Study Hours")
plt.ylabel("Marks")
plt.title("Study Hours vs Marks")
plt.show()
This graph shows the relationship between study hours and marks.
Each point represents one student.
From the graph, we can observe:
As study hours increase, marks also increase.
This is the basic idea behind Machine Learning prediction.
Cell 15: Attendance vs Marks Graph
plt.scatter(df["Attendance"], df["Marks"])
plt.xlabel("Attendance Percentage")
plt.ylabel("Marks")
plt.title("Attendance vs Marks")
plt.show()
This graph shows the relationship between attendance and marks.
We can observe whether students with higher attendance are scoring better marks.
This helps us understand how attendance affects performance.
Cell 16: Bar Chart of Students and Marks
plt.bar(df["Student"], df["Marks"])
plt.xlabel("Student")
plt.ylabel("Marks")
plt.title("Student Marks Comparison")
plt.show()
This bar chart compares the marks of all students.
It is useful when we want to compare individual performance.
Each bar represents one student’s marks.
Cell 17: Line Chart of Study Hours and Marks
plt.plot(df["Study_Hours"], df["Marks"], marker="o")
plt.xlabel("Study Hours")
plt.ylabel("Marks")
plt.title("Study Hours and Marks Trend")
plt.show()
This line chart shows the trend between study hours and marks.
The line helps us clearly see the increasing pattern.
This is useful before applying Linear Regression.
Cell 18: Correlation Between Columns
df[["Study_Hours", "Attendance", "Marks"]].corr()
Correlation tells us how strongly two columns are related.
The value ranges from:
-1 to +1
If the value is close to +1, it means both values increase together.
For example:
Study hours increase → Marks increase
This is called positive correlation.
Cell 19: Final Observation
print("Final Observation:")
print("Students who study more hours generally score higher marks.")
print("Students with better attendance also tend to score better marks.")
This is our final conclusion from the analysis.
Before building any Machine Learning model, we first need to understand the data.
This project helps us learn basic data analysis using Python.
Complete Code in One Place
import pandas as pd
import matplotlib.pyplot as plt
data = {
"Student": ["Amit", "Sneha", "Rahul", "Priya", "Kiran", "Anjali", "Ravi", "Meena"],
"Study_Hours": [1, 2, 3, 4, 5, 6, 7, 8],
"Attendance": [50, 55, 60, 65, 70, 75, 85, 90],
"Marks": [35, 40, 50, 55, 65, 70, 85, 90]
}
df = pd.DataFrame(data)
print("Dataset:")
print(df)
print("\nFirst Five Records:")
print(df.head())
print("\nDataset Shape:")
print(df.shape)
print("\nDataset Information:")
print(df.info())
print("\nStatistical Summary:")
print(df.describe())
print("\nMissing Values:")
print(df.isnull().sum())
print("\nStudents Sorted by Marks:")
print(df.sort_values(by="Marks", ascending=False))
average_study_hours = df["Study_Hours"].mean()
print("\nAverage Study Hours:", average_study_hours)
average_marks = df["Marks"].mean()
print("Average Marks:", average_marks)
top_student = df[df["Marks"] == df["Marks"].max()]
print("\nTop Student:")
print(top_student)
low_student = df[df["Marks"] == df["Marks"].min()]
print("\nLowest Marks Student:")
print(low_student)
plt.scatter(df["Study_Hours"], df["Marks"])
plt.xlabel("Study Hours")
plt.ylabel("Marks")
plt.title("Study Hours vs Marks")
plt.show()
plt.scatter(df["Attendance"], df["Marks"])
plt.xlabel("Attendance Percentage")
plt.ylabel("Marks")
plt.title("Attendance vs Marks")
plt.show()
plt.bar(df["Student"], df["Marks"])
plt.xlabel("Student")
plt.ylabel("Marks")
plt.title("Student Marks Comparison")
plt.show()
plt.plot(df["Study_Hours"], df["Marks"], marker="o")
plt.xlabel("Study Hours")
plt.ylabel("Marks")
plt.title("Study Hours and Marks Trend")
plt.show()
print("\nCorrelation:")
print(df[["Study_Hours", "Attendance", "Marks"]].corr())
print("\nFinal Observation:")
print("Students who study more hours generally score higher marks.")
print("Students with better attendance also tend to score better marks.")
Summary
In this first project, we learned:
- How to create a dataset manually
- How to convert data into a DataFrame
- How to inspect data
- How to check missing values
- How to calculate average values
- How to find top and low performers
- How to visualize data using graphs
- How to understand relationships between columns