ML Regression — House Price Prediction with Python

2026/02/247 min read
bookmark this
Responsive image

Table of Contents

  1. Introduction
  2. Import Libraries
  3. Load the Dataset
  4. Exploratory Data Analysis (EDA)
  5. Feature Engineering & Preprocessing
  6. Train the Model
  7. Evaluate the Model
  8. Visualise Predictions
  9. Summary & Next Steps
  10. Source Code

Introduction

In this post we walk through a complete supervised-learning regression workflow to predict house prices. We will use a synthetic dataset with features such as house size, number of bedrooms, bathrooms, age of the property, and distance to the city centre to predict the house price.

By the end you will understand how to:

  • Perform Exploratory Data Analysis (EDA)
  • Preprocess features with StandardScaler
  • Train a Linear Regression model with scikit-learn
  • Evaluate performance with MAE, RMSE, and R²
  • Interpret model coefficients and residual plots

The full Jupyter Notebook is available in the GitHub repository linked at the bottom of this post.


Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

%matplotlib inline
sns.set_theme(style='whitegrid')

Key packages:

Package Purpose
pandas Data manipulation and analysis
numpy Numerical computing
matplotlib / seaborn Data visualisation
scikit-learn ML model training, preprocessing, and evaluation

Load the Dataset

df = pd.read_csv('dataset.csv')
print(f'Shape: {df.shape}')
df.head()

The dataset contains 50 rows and columns for house_size_sqft, num_bedrooms, num_bathrooms, age_years, distance_to_city_km, and the target price_usd.


Exploratory Data Analysis (EDA)

Basic statistics

df.describe()

The describe() output gives us count, mean, standard deviation, min, quartiles (25%, 50%, 75%), and max for every numeric column. Key things to note:

Field Meaning
mean Average value — the central tendency
std Standard deviation — how spread out values are
IQR (Q3 − Q1) Useful for detecting outliers

Check for missing values

print(df.isnull().sum())

No missing values were found in this dataset.

Distribution of house prices

plt.figure(figsize=(8, 4))
sns.histplot(df['price_usd'], bins=15, kde=True)
plt.title('Distribution of House Prices')
plt.xlabel('Price (USD)')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

Correlation heatmap

plt.figure(figsize=(8, 6))
sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Feature Correlation Heatmap')
plt.tight_layout()
plt.show()

The Pearson correlation coefficients reveal important relationships:

Pair Correlation Meaning
house_size_sqftprice_usd +1.00 Larger houses → higher prices
num_bedroomsprice_usd +0.97 More bedrooms → higher prices
age_yearsprice_usd -0.90 Older houses → lower prices
distance_to_city_kmprice_usd -0.86 Farther from city → lower prices
age_yearsdistance_to_city_km +0.99 🚨 Multicollinearity risk — these carry redundant information

⚠️ Multicollinearity occurs when independent features are highly correlated with each other. This can make model coefficients unstable. Consider dropping one of the correlated features or using regularisation (Ridge / Lasso).

Scatter plots — each feature vs price

features = ['house_size_sqft', 'num_bedrooms', 'num_bathrooms',
            'age_years', 'distance_to_city_km']

fig, axes = plt.subplots(2, 3, figsize=(14, 8))
axes = axes.flatten()

for i, feature in enumerate(features):
    axes[i].scatter(df[feature], df['price_usd'], alpha=0.6)
    axes[i].set_xlabel(feature)
    axes[i].set_ylabel('Price (USD)')
    axes[i].set_title(f'{feature} vs Price')

axes[-1].set_visible(False)
plt.tight_layout()
plt.show()

Feature Engineering & Preprocessing

Train / test split

X = df[features]
y = df['price_usd']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f'Training samples : {len(X_train)}')
print(f'Test samples     : {len(X_test)}')

We use an 80/20 split — 80 % for training, 20 % for evaluation.

Feature scaling

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)

StandardScaler normalises features to mean = 0, std = 1. This ensures no single feature dominates the model simply because of its scale.


Train the Model

model = LinearRegression()
model.fit(X_train_scaled, y_train)

print('Model coefficients:')
for name, coef in zip(features, model.coef_):
    print(f'  {name:30s}: {coef:,.2f}')
print(f'  Intercept                     : {model.intercept_:,.2f}')

Linear Regression fits the equation:

$$\hat{y} = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_n x_n$$

Where $b_0$ is the intercept and $b_1 \ldots b_n$ are the coefficients.

Interpreting the coefficients

Since we used StandardScaler, each coefficient reflects the impact of a 1 standard deviation change in that feature:

Feature Coefficient Interpretation
house_size_sqft +114,906 🥇 Strongest predictor — 1 std increase (~633 sqft) raises price by ~$114,906
age_years -27,505 🥈 Older houses lose value — 1 std increase (~10.6 yrs) drops price by ~$27,505
distance_to_city_km +27,344 ⚠️ Positive — surprising! Likely caused by multicollinearity with age_years
num_bedrooms +16,139 More bedrooms → higher price
num_bathrooms +2,377 Weakest predictor
Intercept 264,725 Predicted price when all scaled features are at 0 (close to the dataset mean)

🔍 The unexpected positive sign of distance_to_city_km is a classic symptom of multicollinearity. When two features are nearly identical (0.99 correlation), the model cannot reliably assign individual effects. Solutions include dropping one feature or using Ridge / Lasso regression.


Evaluate the Model

y_pred = model.predict(X_test_scaled)

mae  = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2   = r2_score(y_test, y_pred)

print(f'Mean Absolute Error  (MAE) : ${mae:,.2f}')
print(f'Root Mean Squared Error    : ${rmse:,.2f}')
print(f'R² Score                   : {r2:.4f}')

Results

Metric Value What it means
MAE $5,563 On average, predictions are off by ~$5,563
RMSE $6,286 Slightly higher than MAE — no major outlier errors
0.9982 The model explains 99.82 % of the variance in house prices

How to interpret each metric:

  • MAE — average absolute error. Easy to understand: "our prediction is off by about $5,563 on average."
  • RMSE — penalises large errors more heavily. When RMSE ≈ MAE, errors are evenly distributed.
  • — proportion of variance explained. 0.9982 is excellent, though on a small dataset (50 rows) it may indicate overfitting. Cross-validation is recommended.

Visualise Predictions

Actual vs Predicted

plt.figure(figsize=(7, 5))
plt.scatter(y_test, y_pred, alpha=0.7, edgecolors='k', linewidths=0.4)
min_val = min(y_test.min(), y_pred.min())
max_val = max(y_test.max(), y_pred.max())
plt.plot([min_val, max_val], [min_val, max_val], 'r--', label='Perfect prediction')
plt.xlabel('Actual Price (USD)')
plt.ylabel('Predicted Price (USD)')
plt.title('Actual vs Predicted House Prices')
plt.legend()
plt.tight_layout()
plt.show()

Points close to the red dashed line indicate accurate predictions. Our model's predictions cluster tightly around the line.

Residual plot

residuals = y_test - y_pred

plt.figure(figsize=(7, 4))
plt.scatter(y_pred, residuals, alpha=0.7, edgecolors='k', linewidths=0.4)
plt.axhline(0, color='r', linestyle='--')
plt.xlabel('Predicted Price (USD)')
plt.ylabel('Residuals (USD)')
plt.title('Residual Plot')
plt.tight_layout()
plt.show()

How to read residuals:

Pattern Meaning
✅ Random scatter around 0 Good — errors are random, model fits well
❌ Funnel shape Heteroscedasticity — errors grow with larger predictions
❌ Curved pattern Non-linearity — consider polynomial features

Our residuals fall between -$8,000 and +$8,000, which is small relative to house prices ranging from $88K to $560K.


Summary & Next Steps

Metric Value
MAE ~$5,563
RMSE ~$6,286
0.9982

Key Takeaways

  • house_size_sqft is the dominant predictor of house price.
  • age_years and distance_to_city_km have a negative effect on price, which is intuitive.
  • High multicollinearity between age_years and distance_to_city_km (0.99) makes individual coefficients unreliable — consider regularisation.
  • Linear Regression achieves a high R² on this dataset because the underlying relationships are roughly linear.

Possible Next Steps

  • Try Ridge / Lasso regression to add regularisation and handle multicollinearity.
  • Experiment with polynomial features to capture non-linear relationships.
  • Use a Random Forest Regressor or Gradient Boosting for potentially higher accuracy.
  • Apply cross-validation to verify the model is not overfitting.

Source Code

The complete Jupyter Notebook with all the code, outputs, and visualisations is available on GitHub:

🔗 regression_analysis.ipynb — keke78ui9/learn-ai

Feel free to clone the repository and experiment with the notebook yourself:

git clone https://github.com/keke78ui9/learn-ai.git
cd learn-ai/ML_regression_01
jupyter notebook regression_analysis.ipynb