ML Regression — House Price Prediction with Python

2026/02/247 min read

Introduction
Import Libraries
Load the Dataset
Exploratory Data Analysis (EDA)
Feature Engineering & Preprocessing
Train the Model
Evaluate the Model
Visualise Predictions
Summary & Next Steps
Source Code

Introduction

In this post we walk through a complete supervised-learning regression workflow to predict house prices. We will use a synthetic dataset with features such as house size, number of bedrooms, bathrooms, age of the property, and distance to the city centre to predict the house price.

By the end you will understand how to:

Perform Exploratory Data Analysis (EDA)
Preprocess features with StandardScaler
Train a Linear Regression model with scikit-learn
Evaluate performance with MAE, RMSE, and R²
Interpret model coefficients and residual plots

The full Jupyter Notebook is available in the GitHub repository linked at the bottom of this post.

Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

%matplotlib inline
sns.set_theme(style='whitegrid')

Key packages:

Package	Purpose
pandas	Data manipulation and analysis
numpy	Numerical computing
matplotlib / seaborn	Data visualisation
scikit-learn	ML model training, preprocessing, and evaluation

Load the Dataset

df = pd.read_csv('dataset.csv')
print(f'Shape: {df.shape}')
df.head()

The dataset contains 50 rows and columns for house_size_sqft, num_bedrooms, num_bathrooms, age_years, distance_to_city_km, and the target price_usd.

Exploratory Data Analysis (EDA)

Basic statistics

df.describe()

The describe() output gives us count, mean, standard deviation, min, quartiles (25%, 50%, 75%), and max for every numeric column. Key things to note:

Field	Meaning
mean	Average value — the central tendency
std	Standard deviation — how spread out values are
IQR (Q3 − Q1)	Useful for detecting outliers

Check for missing values

print(df.isnull().sum())

No missing values were found in this dataset.

Distribution of house prices

plt.figure(figsize=(8, 4))
sns.histplot(df['price_usd'], bins=15, kde=True)
plt.title('Distribution of House Prices')
plt.xlabel('Price (USD)')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

Correlation heatmap

plt.figure(figsize=(8, 6))
sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Feature Correlation Heatmap')
plt.tight_layout()
plt.show()

The Pearson correlation coefficients reveal important relationships:

Pair	Correlation	Meaning
`house_size_sqft` ↔ `price_usd`	+1.00	Larger houses → higher prices
`num_bedrooms` ↔ `price_usd`	+0.97	More bedrooms → higher prices
`age_years` ↔ `price_usd`	-0.90	Older houses → lower prices
`distance_to_city_km` ↔ `price_usd`	-0.86	Farther from city → lower prices
`age_years` ↔ `distance_to_city_km`	+0.99	🚨 Multicollinearity risk — these carry redundant information

⚠️ Multicollinearity occurs when independent features are highly correlated with each other. This can make model coefficients unstable. Consider dropping one of the correlated features or using regularisation (Ridge / Lasso).

Scatter plots — each feature vs price

features = ['house_size_sqft', 'num_bedrooms', 'num_bathrooms',
            'age_years', 'distance_to_city_km']

fig, axes = plt.subplots(2, 3, figsize=(14, 8))
axes = axes.flatten()

for i, feature in enumerate(features):
    axes[i].scatter(df[feature], df['price_usd'], alpha=0.6)
    axes[i].set_xlabel(feature)
    axes[i].set_ylabel('Price (USD)')
    axes[i].set_title(f'{feature} vs Price')

axes[-1].set_visible(False)
plt.tight_layout()
plt.show()

Feature Engineering & Preprocessing

Train / test split

X = df[features]
y = df['price_usd']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f'Training samples : {len(X_train)}')
print(f'Test samples     : {len(X_test)}')

We use an 80/20 split — 80 % for training, 20 % for evaluation.

Feature scaling

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)

StandardScaler normalises features to mean = 0, std = 1. This ensures no single feature dominates the model simply because of its scale.

Train the Model

model = LinearRegression()
model.fit(X_train_scaled, y_train)

print('Model coefficients:')
for name, coef in zip(features, model.coef_):
    print(f'  {name:30s}: {coef:,.2f}')
print(f'  Intercept                     : {model.intercept_:,.2f}')

Linear Regression fits the equation:

$$\hat{y} = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_n x_n$$

Where $b_0$ is the intercept and $b_1 \ldots b_n$ are the coefficients.

Interpreting the coefficients

Since we used StandardScaler, each coefficient reflects the impact of a 1 standard deviation change in that feature:

Feature	Coefficient	Interpretation
`house_size_sqft`	+114,906	🥇 Strongest predictor — 1 std increase (~633 sqft) raises price by ~$114,906
`age_years`	-27,505	🥈 Older houses lose value — 1 std increase (~10.6 yrs) drops price by ~$27,505
`distance_to_city_km`	+27,344	⚠️ Positive — surprising! Likely caused by multicollinearity with `age_years`
`num_bedrooms`	+16,139	More bedrooms → higher price
`num_bathrooms`	+2,377	Weakest predictor
Intercept	264,725	Predicted price when all scaled features are at 0 (close to the dataset mean)

🔍 The unexpected positive sign of distance_to_city_km is a classic symptom of multicollinearity. When two features are nearly identical (0.99 correlation), the model cannot reliably assign individual effects. Solutions include dropping one feature or using Ridge / Lasso regression.

Evaluate the Model

y_pred = model.predict(X_test_scaled)

mae  = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2   = r2_score(y_test, y_pred)

print(f'Mean Absolute Error  (MAE) : ${mae:,.2f}')
print(f'Root Mean Squared Error    : ${rmse:,.2f}')
print(f'R² Score                   : {r2:.4f}')

Results

Metric	Value	What it means
MAE	$5,563	On average, predictions are off by ~$5,563
RMSE	$6,286	Slightly higher than MAE — no major outlier errors
R²	0.9982	The model explains 99.82 % of the variance in house prices

How to interpret each metric:

MAE — average absolute error. Easy to understand: "our prediction is off by about $5,563 on average."
RMSE — penalises large errors more heavily. When RMSE ≈ MAE, errors are evenly distributed.
R² — proportion of variance explained. 0.9982 is excellent, though on a small dataset (50 rows) it may indicate overfitting. Cross-validation is recommended.

Visualise Predictions

Actual vs Predicted

plt.figure(figsize=(7, 5))
plt.scatter(y_test, y_pred, alpha=0.7, edgecolors='k', linewidths=0.4)
min_val = min(y_test.min(), y_pred.min())
max_val = max(y_test.max(), y_pred.max())
plt.plot([min_val, max_val], [min_val, max_val], 'r--', label='Perfect prediction')
plt.xlabel('Actual Price (USD)')
plt.ylabel('Predicted Price (USD)')
plt.title('Actual vs Predicted House Prices')
plt.legend()
plt.tight_layout()
plt.show()

Points close to the red dashed line indicate accurate predictions. Our model's predictions cluster tightly around the line.

Residual plot

residuals = y_test - y_pred

plt.figure(figsize=(7, 4))
plt.scatter(y_pred, residuals, alpha=0.7, edgecolors='k', linewidths=0.4)
plt.axhline(0, color='r', linestyle='--')
plt.xlabel('Predicted Price (USD)')
plt.ylabel('Residuals (USD)')
plt.title('Residual Plot')
plt.tight_layout()
plt.show()

How to read residuals:

Pattern	Meaning
✅ Random scatter around 0	Good — errors are random, model fits well
❌ Funnel shape	Heteroscedasticity — errors grow with larger predictions
❌ Curved pattern	Non-linearity — consider polynomial features

Our residuals fall between -$8,000 and +$8,000, which is small relative to house prices ranging from $88K to $560K.

Summary & Next Steps

Metric	Value
MAE	~$5,563
RMSE	~$6,286
R²	0.9982

Key Takeaways

house_size_sqft is the dominant predictor of house price.
age_years and distance_to_city_km have a negative effect on price, which is intuitive.
High multicollinearity between age_years and distance_to_city_km (0.99) makes individual coefficients unreliable — consider regularisation.
Linear Regression achieves a high R² on this dataset because the underlying relationships are roughly linear.

Possible Next Steps

Try Ridge / Lasso regression to add regularisation and handle multicollinearity.
Experiment with polynomial features to capture non-linear relationships.
Use a Random Forest Regressor or Gradient Boosting for potentially higher accuracy.
Apply cross-validation to verify the model is not overfitting.

Source Code

The complete Jupyter Notebook with all the code, outputs, and visualisations is available on GitHub:

🔗 regression_analysis.ipynb — keke78ui9/learn-ai

Feel free to clone the repository and experiment with the notebook yourself:

git clone https://github.com/keke78ui9/learn-ai.git
cd learn-ai/ML_regression_01
jupyter notebook regression_analysis.ipynb

ML Regression — House Price Prediction with Python

Table of Contents

Introduction

Import Libraries

Load the Dataset

Exploratory Data Analysis (EDA)

Basic statistics

Check for missing values

Distribution of house prices

Correlation heatmap

Scatter plots — each feature vs price

Feature Engineering & Preprocessing

Train / test split

Feature scaling

Train the Model

Interpreting the coefficients

Evaluate the Model

Results

Visualise Predictions

Actual vs Predicted

Residual plot

Summary & Next Steps

Key Takeaways

Possible Next Steps

Source Code

More Blogs Like This

ML Regression — House Price Prediction with Python

Table of Contents

Introduction

Import Libraries

Load the Dataset

Exploratory Data Analysis (EDA)

Basic statistics

Check for missing values

Distribution of house prices

Correlation heatmap

Scatter plots — each feature vs price

Feature Engineering & Preprocessing

Train / test split

Feature scaling

Train the Model

Interpreting the coefficients

Evaluate the Model

Results

Visualise Predictions

Actual vs Predicted

Residual plot

Summary & Next Steps

Key Takeaways

Possible Next Steps

Source Code

More Blogs Like This

Subscribe for Updates