Project 1 - Machine Learning

Project 1: Student Study Analysis

Objective

In this project, we will analyze student study data and understand the relationship between:

Study hours
Attendance
Marks

This is a beginner-level data analysis project before starting ML model training.

Cell 1: Import Required Libraries

import pandas as pd
import matplotlib.pyplot as plt

Here we are importing two important Python libraries.

pandas is used for working with data in table format.

matplotlib.pyplot is used for creating graphs and visualizations.

Cell 2: Create Student Dataset

data = {
    "Student": ["Amit", "Sneha", "Rahul", "Priya", "Kiran", "Anjali", "Ravi", "Meena"],
    "Study_Hours": [1, 2, 3, 4, 5, 6, 7, 8],
    "Attendance": [50, 55, 60, 65, 70, 75, 85, 90],
    "Marks": [35, 40, 50, 55, 65, 70, 85, 90]
}

Here we are creating a small student dataset manually.

This dataset contains four columns:

Student: student name
Study_Hours: how many hours the student studies
Attendance: attendance percentage
Marks: marks scored by the student

We are not using any external file. The data is created directly inside Python.

Cell 3: Convert Data into DataFrame

df = pd.DataFrame(data)

print(df)

Here we convert the dictionary data into a Pandas DataFrame.

A DataFrame is like a table with rows and columns.

In Machine Learning, most datasets are handled in DataFrame format.

Cell 4: Display First Few Records

df.head()

head() is used to display the first five rows of the dataset.

This helps us quickly check whether the data is loaded correctly or not.

Cell 5: Check Dataset Shape

df.shape

shape tells us the number of rows and columns in the dataset.

For example:

(8, 4)

This means the dataset has:

8 rows
4 columns

Cell 6: Check Dataset Information

df.info()

info() gives basic information about the dataset.

It shows:

column names
number of non-empty values
data type of each column

This is useful to check missing values and data types.

Cell 7: Statistical Summary

df.describe()

describe() gives statistical information about numerical columns.

It shows:

count
mean
minimum value
maximum value
standard deviation
25%, 50%, and 75% values

This helps us understand the data distribution.

Cell 8: Check Missing Values

df.isnull().sum()

This cell checks whether there are any missing values in the dataset.

isnull() checks missing values.

sum() counts how many missing values are present in each column.

If output is 0 for every column, it means there are no missing values.

Cell 9: Sort Students by Marks

df.sort_values(by="Marks", ascending=False)

This sorts the students based on marks.

ascending=False means highest marks will come first.

This helps us identify top-performing students.

Cell 10: Average Study Hours

average_study_hours = df["Study_Hours"].mean()

print("Average Study Hours:", average_study_hours)

Here we calculate the average study hours of all students.

mean() is used to calculate the average value.

This tells us how many hours students study on average.

Cell 11: Average Marks

average_marks = df["Marks"].mean()

print("Average Marks:", average_marks)

Here we calculate the average marks of all students.

This helps us understand the overall performance of students.

Cell 12: Student with Highest Marks

top_student = df[df["Marks"] == df["Marks"].max()]

top_student

Here we find the student who scored the highest marks.

max() gives the maximum marks.

Then we filter the dataset to show the student with those marks.

Cell 13: Student with Lowest Marks

low_student = df[df["Marks"] == df["Marks"].min()]

low_student

Here we find the student who scored the lowest marks.

min() gives the minimum marks.

This helps us identify students who may need improvement.

Cell 14: Study Hours vs Marks Graph

plt.scatter(df["Study_Hours"], df["Marks"])

plt.xlabel("Study Hours")
plt.ylabel("Marks")
plt.title("Study Hours vs Marks")

plt.show()

This graph shows the relationship between study hours and marks.

Each point represents one student.

From the graph, we can observe:

As study hours increase, marks also increase.

This is the basic idea behind Machine Learning prediction.

Cell 15: Attendance vs Marks Graph

plt.scatter(df["Attendance"], df["Marks"])

plt.xlabel("Attendance Percentage")
plt.ylabel("Marks")
plt.title("Attendance vs Marks")

plt.show()

This graph shows the relationship between attendance and marks.

We can observe whether students with higher attendance are scoring better marks.

This helps us understand how attendance affects performance.

Cell 16: Bar Chart of Students and Marks

plt.bar(df["Student"], df["Marks"])

plt.xlabel("Student")
plt.ylabel("Marks")
plt.title("Student Marks Comparison")

plt.show()

This bar chart compares the marks of all students.

It is useful when we want to compare individual performance.

Each bar represents one student’s marks.

Cell 17: Line Chart of Study Hours and Marks

plt.plot(df["Study_Hours"], df["Marks"], marker="o")

plt.xlabel("Study Hours")
plt.ylabel("Marks")
plt.title("Study Hours and Marks Trend")

plt.show()

This line chart shows the trend between study hours and marks.

The line helps us clearly see the increasing pattern.

This is useful before applying Linear Regression.

Cell 18: Correlation Between Columns

df[["Study_Hours", "Attendance", "Marks"]].corr()

Correlation tells us how strongly two columns are related.

The value ranges from:

-1 to +1

If the value is close to +1, it means both values increase together.

For example:

Study hours increase → Marks increase

This is called positive correlation.

Cell 19: Final Observation

print("Final Observation:")
print("Students who study more hours generally score higher marks.")
print("Students with better attendance also tend to score better marks.")

This is our final conclusion from the analysis.

Before building any Machine Learning model, we first need to understand the data.

This project helps us learn basic data analysis using Python.

Complete Code in One Place

import pandas as pd
import matplotlib.pyplot as plt

data = {
    "Student": ["Amit", "Sneha", "Rahul", "Priya", "Kiran", "Anjali", "Ravi", "Meena"],
    "Study_Hours": [1, 2, 3, 4, 5, 6, 7, 8],
    "Attendance": [50, 55, 60, 65, 70, 75, 85, 90],
    "Marks": [35, 40, 50, 55, 65, 70, 85, 90]
}

df = pd.DataFrame(data)

print("Dataset:")
print(df)

print("\nFirst Five Records:")
print(df.head())

print("\nDataset Shape:")
print(df.shape)

print("\nDataset Information:")
print(df.info())

print("\nStatistical Summary:")
print(df.describe())

print("\nMissing Values:")
print(df.isnull().sum())

print("\nStudents Sorted by Marks:")
print(df.sort_values(by="Marks", ascending=False))

average_study_hours = df["Study_Hours"].mean()
print("\nAverage Study Hours:", average_study_hours)

average_marks = df["Marks"].mean()
print("Average Marks:", average_marks)

top_student = df[df["Marks"] == df["Marks"].max()]
print("\nTop Student:")
print(top_student)

low_student = df[df["Marks"] == df["Marks"].min()]
print("\nLowest Marks Student:")
print(low_student)

plt.scatter(df["Study_Hours"], df["Marks"])
plt.xlabel("Study Hours")
plt.ylabel("Marks")
plt.title("Study Hours vs Marks")
plt.show()

plt.scatter(df["Attendance"], df["Marks"])
plt.xlabel("Attendance Percentage")
plt.ylabel("Marks")
plt.title("Attendance vs Marks")
plt.show()

plt.bar(df["Student"], df["Marks"])
plt.xlabel("Student")
plt.ylabel("Marks")
plt.title("Student Marks Comparison")
plt.show()

plt.plot(df["Study_Hours"], df["Marks"], marker="o")
plt.xlabel("Study Hours")
plt.ylabel("Marks")
plt.title("Study Hours and Marks Trend")
plt.show()

print("\nCorrelation:")
print(df[["Study_Hours", "Attendance", "Marks"]].corr())

print("\nFinal Observation:")
print("Students who study more hours generally score higher marks.")
print("Students with better attendance also tend to score better marks.")

Summary

In this first project, we learned:

How to create a dataset manually
How to convert data into a DataFrame
How to inspect data
How to check missing values
How to calculate average values
How to find top and low performers
How to visualize data using graphs
How to understand relationships between columns