Multiple Linear Regression: Code
Multiple Linear Regression — House Price Prediction Project
In this tutorial, we will build a Multiple Linear Regression model to predict house prices using multiple input features.
The model will learn the relationship between:
-
house area,
-
number of bedrooms,
-
age of the house,
-
and the final house price.
Unlike Simple Linear Regression, which uses only one input feature, Multiple Linear Regression uses multiple features to improve prediction accuracy.
In this tutorial, we will:
-
create a small dataset,
-
visualize the data,
-
train the regression model,
-
understand coefficients and intercept,
-
evaluate the model,
-
and predict prices for new houses.
This tutorial is beginner-friendly and helps build a strong foundation in Machine Learning regression models.
Step 1: Import Required Libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
In this step, we import all the required libraries for our Multiple Linear Regression project.
-
pandas is used to create and manage datasets.
-
matplotlib.pyplot is used for plotting graphs and visualizing relationships between variables.
-
LinearRegression is the machine learning algorithm used to train the regression model.
-
The metrics functions help evaluate the model performance.
Importing libraries is always the first step in most machine learning projects.
Step 2: Create the Dataset
data = {
'Area': [1000, 1200, 1500, 1800, 2000, 2200, 2500, 2800],
'Bedrooms': [2, 2, 3, 3, 4, 4, 4, 5],
'Age': [5, 4, 3, 2, 2, 1, 1, 1],
'Price': [500000, 580000, 720000, 850000, 950000, 1050000, 1200000, 1350000]
}
df = pd.DataFrame(data)
df
Here we create a simple house price dataset manually.
The dataset contains:
-
Area → size of the house
-
Bedrooms → number of bedrooms
-
Age → age of the house
-
Price → final house price
Each row represents one house.
We then convert the dictionary into a pandas DataFrame so that we can easily work with the data.
Step 3: Check Dataset Information
df.info()
The info() function provides a quick overview of the dataset.
It shows:
-
total number of rows
-
column names
-
data types
-
missing values
This step helps verify whether the dataset is clean before training the model.
Step 4: Display First Rows
df.head()
The head() function displays the first five rows of the dataset.
This helps us:
-
understand the structure of the data
-
verify column names
-
inspect sample values
It is one of the most commonly used functions during data analysis.
Step 5: Check Missing Values
df.isnull().sum()
This step checks whether the dataset contains missing values.
Missing values can negatively affect machine learning models, so it is important to identify them before training.
The output shows the number of missing values in each column.
Step 6: Visualize Area vs Price
plt.scatter(df['Area'], df['Price'])
plt.xlabel('Area')
plt.ylabel('Price')
plt.title('Area vs Price')
plt.show()
This scatter plot visualizes the relationship between house area and house price.
From the graph, we can observe that:
-
larger houses generally have higher prices
-
there is a positive relationship between area and price
Visualization helps us better understand the data before model training.
Step 7: Visualize Bedrooms vs Price
plt.scatter(df['Bedrooms'], df['Price'])
plt.xlabel('Bedrooms')
plt.ylabel('Price')
plt.title('Bedrooms vs Price')
plt.show()
This graph shows how the number of bedrooms affects house prices.
We can observe that houses with more bedrooms usually have higher prices.
This indicates that bedrooms are an important feature for prediction.
Step 8: Visualize Age vs Price
plt.scatter(df['Age'], df['Price'])
plt.xlabel('Age')
plt.ylabel('Price')
plt.title('Age vs Price')
plt.show()
This scatter plot shows the relationship between house age and house price.
In many cases:
-
newer houses have higher prices
-
older houses may have lower prices
This feature can influence the final prediction.
Step 9: Select Features and Target Variable
X = df[['Area', 'Bedrooms', 'Age']]
y = df['Price']
In machine learning:
-
input variables are called features
-
output variable is called the target
Here:
-
X contains the input features:
-
Area
-
Bedrooms
-
Age
-
-
y contains the target variable:
-
Price
-
The model will learn how these features affect house prices.
Step 10: Create the Linear Regression Model
model = LinearRegression()
In this step, we create the Multiple Linear Regression model.
At this point:
-
the model is empty
-
no learning has happened yet
The model will learn patterns only after training.
Step 11: Train the Model
model.fit(X, y)
The fit() function trains the model using the dataset.
During training, the model learns:
-
how each feature affects house price
-
the relationship between inputs and outputs
This is the most important step in machine learning.
Step 12: Print Model Coefficients
print(model.coef_)
Coefficients represent the importance of each feature.
Each coefficient tells us:
-
how much the price changes
-
when that feature increases by one unit
Larger coefficient values indicate stronger influence on prediction.
Step 13: Print Intercept
print(model.intercept_)
The intercept is the starting value of the regression equation.
It is the predicted value when all input features are zero.
The intercept is required to form the final regression equation.
Step 14: Display Feature Coefficients Clearly
coefficients = pd.DataFrame({
'Feature': X.columns,
'Coefficient': model.coef_
})
coefficients
This step creates a table containing:
-
feature names
-
their coefficient values
This makes the model easier to interpret and understand.
Step 15: Plot Feature Importance
plt.bar(coefficients['Feature'], coefficients['Coefficient'])
plt.xlabel('Features')
plt.ylabel('Coefficient Value')
plt.title('Feature Importance')
plt.show()
This bar chart visualizes feature importance.
The graph helps us quickly identify:
-
which features affect price the most
-
which features have smaller impact
Visualization improves model interpretability.
Step 16: Predict House Prices
predictions = model.predict(X)
predictions
The predict() function generates predicted house prices using the trained model.
The model uses learned patterns to estimate prices from input features.
Step 17: Compare Actual and Predicted Values
comparison = pd.DataFrame({
'Actual Price': y,
'Predicted Price': predictions
})
comparison
This table compares:
-
actual house prices
-
predicted house prices
Comparing both values helps evaluate how accurately the model performs.
Step 18: Plot Actual vs Predicted Prices
plt.scatter(y, predictions)
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.title('Actual vs Predicted Prices')
plt.show()
This graph compares actual prices with predicted prices.
If predictions are accurate:
-
points appear close together
-
predictions follow a consistent pattern
This visualization helps assess model quality.
Step 19: Calculate Residual Errors
residuals = y - predictions
residuals
Residuals represent prediction errors.
Formula:
Residual = Actual Value - Predicted Value
Smaller residual values indicate better predictions.
Step 20: Plot Residual Errors
plt.scatter(predictions, residuals)
plt.xlabel('Predicted Prices')
plt.ylabel('Residual Errors')
plt.title('Residual Plot')
plt.axhline(y=0)
plt.show()
The residual plot visualizes prediction errors.
A good regression model should produce:
-
small errors
-
randomly distributed errors
Patterns in residuals may indicate model problems.
Step 21: Evaluate Model Performance
mae = mean_absolute_error(y, predictions)
mse = mean_squared_error(y, predictions)
r2 = r2_score(y, predictions)
print("MAE:", mae)
print("MSE:", mse)
print("R2 Score:", r2)
These metrics evaluate model performance.
-
MAE measures average prediction error.
-
MSE penalizes larger errors more strongly.
-
R2 Score measures how well the model explains the data.
Higher R2 scores usually indicate better models.
Step 22: Predict Price for a New House
new_house = [[2100, 4, 2]]
predicted_price = model.predict(new_house)
print(predicted_price[0])
In this step, we test the model using new house details.
Input values:
-
Area = 2100
-
Bedrooms = 4
-
Age = 2
The model predicts the expected house price based on learned patterns.
Step 23: Print Final Equation Values
print("Intercept:", model.intercept_)
for feature, coef in zip(X.columns, model.coef_):
print(feature, coef)
This step prints the final learned values of:
-
intercept
-
coefficients
These values form the final regression equation used by the model for predictions.
Summary
In this tutorial, we learned how to build a Multiple Linear Regression model using Python and Scikit-Learn to predict house prices based on multiple features such as area, bedrooms, and house age. We created a dataset, visualized relationships between features and price using plots, trained the regression model, understood coefficients and intercept values, generated predictions, evaluated model performance using different metrics, and finally predicted the price of a new house using custom input values. This project helps beginners understand how machine learning models learn patterns from data and make real-world predictions.