ML Regression — House Price Prediction with Python
Table of Contents
- Introduction
- Import Libraries
- Load the Dataset
- Exploratory Data Analysis (EDA)
- Feature Engineering & Preprocessing
- Train the Model
- Evaluate the Model
- Visualise Predictions
- Summary & Next Steps
- Source Code
Introduction
In this post we walk through a complete supervised-learning regression workflow to predict house prices. We will use a synthetic dataset with features such as house size, number of bedrooms, bathrooms, age of the property, and distance to the city centre to predict the house price.
By the end you will understand how to:
- Perform Exploratory Data Analysis (EDA)
- Preprocess features with
StandardScaler - Train a Linear Regression model with scikit-learn
- Evaluate performance with MAE, RMSE, and R²
- Interpret model coefficients and residual plots
The full Jupyter Notebook is available in the GitHub repository linked at the bottom of this post.
Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
%matplotlib inline
sns.set_theme(style='whitegrid')
Key packages:
| Package | Purpose |
|---|---|
| pandas | Data manipulation and analysis |
| numpy | Numerical computing |
| matplotlib / seaborn | Data visualisation |
| scikit-learn | ML model training, preprocessing, and evaluation |
Load the Dataset
df = pd.read_csv('dataset.csv')
print(f'Shape: {df.shape}')
df.head()
The dataset contains 50 rows and columns for house_size_sqft, num_bedrooms, num_bathrooms, age_years, distance_to_city_km, and the target price_usd.
Exploratory Data Analysis (EDA)
Basic statistics
df.describe()
The describe() output gives us count, mean, standard deviation, min, quartiles (25%, 50%, 75%), and max for every numeric column. Key things to note:
| Field | Meaning |
|---|---|
| mean | Average value — the central tendency |
| std | Standard deviation — how spread out values are |
| IQR (Q3 − Q1) | Useful for detecting outliers |
Check for missing values
print(df.isnull().sum())
No missing values were found in this dataset.
Distribution of house prices
plt.figure(figsize=(8, 4))
sns.histplot(df['price_usd'], bins=15, kde=True)
plt.title('Distribution of House Prices')
plt.xlabel('Price (USD)')
plt.ylabel('Count')
plt.tight_layout()
plt.show()
Correlation heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Feature Correlation Heatmap')
plt.tight_layout()
plt.show()
The Pearson correlation coefficients reveal important relationships:
| Pair | Correlation | Meaning |
|---|---|---|
house_size_sqft ↔ price_usd |
+1.00 | Larger houses → higher prices |
num_bedrooms ↔ price_usd |
+0.97 | More bedrooms → higher prices |
age_years ↔ price_usd |
-0.90 | Older houses → lower prices |
distance_to_city_km ↔ price_usd |
-0.86 | Farther from city → lower prices |
age_years ↔ distance_to_city_km |
+0.99 | 🚨 Multicollinearity risk — these carry redundant information |
⚠️ Multicollinearity occurs when independent features are highly correlated with each other. This can make model coefficients unstable. Consider dropping one of the correlated features or using regularisation (Ridge / Lasso).
Scatter plots — each feature vs price
features = ['house_size_sqft', 'num_bedrooms', 'num_bathrooms',
'age_years', 'distance_to_city_km']
fig, axes = plt.subplots(2, 3, figsize=(14, 8))
axes = axes.flatten()
for i, feature in enumerate(features):
axes[i].scatter(df[feature], df['price_usd'], alpha=0.6)
axes[i].set_xlabel(feature)
axes[i].set_ylabel('Price (USD)')
axes[i].set_title(f'{feature} vs Price')
axes[-1].set_visible(False)
plt.tight_layout()
plt.show()
Feature Engineering & Preprocessing
Train / test split
X = df[features]
y = df['price_usd']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print(f'Training samples : {len(X_train)}')
print(f'Test samples : {len(X_test)}')
We use an 80/20 split — 80 % for training, 20 % for evaluation.
Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
StandardScaler normalises features to mean = 0, std = 1. This ensures no single feature dominates the model simply because of its scale.
Train the Model
model = LinearRegression()
model.fit(X_train_scaled, y_train)
print('Model coefficients:')
for name, coef in zip(features, model.coef_):
print(f' {name:30s}: {coef:,.2f}')
print(f' Intercept : {model.intercept_:,.2f}')
Linear Regression fits the equation:
$$\hat{y} = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_n x_n$$
Where $b_0$ is the intercept and $b_1 \ldots b_n$ are the coefficients.
Interpreting the coefficients
Since we used StandardScaler, each coefficient reflects the impact of a 1 standard deviation change in that feature:
| Feature | Coefficient | Interpretation |
|---|---|---|
house_size_sqft |
+114,906 | 🥇 Strongest predictor — 1 std increase (~633 sqft) raises price by ~$114,906 |
age_years |
-27,505 | 🥈 Older houses lose value — 1 std increase (~10.6 yrs) drops price by ~$27,505 |
distance_to_city_km |
+27,344 | ⚠️ Positive — surprising! Likely caused by multicollinearity with age_years |
num_bedrooms |
+16,139 | More bedrooms → higher price |
num_bathrooms |
+2,377 | Weakest predictor |
| Intercept | 264,725 | Predicted price when all scaled features are at 0 (close to the dataset mean) |
🔍 The unexpected positive sign of
distance_to_city_kmis a classic symptom of multicollinearity. When two features are nearly identical (0.99 correlation), the model cannot reliably assign individual effects. Solutions include dropping one feature or using Ridge / Lasso regression.
Evaluate the Model
y_pred = model.predict(X_test_scaled)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f'Mean Absolute Error (MAE) : ${mae:,.2f}')
print(f'Root Mean Squared Error : ${rmse:,.2f}')
print(f'R² Score : {r2:.4f}')
Results
| Metric | Value | What it means |
|---|---|---|
| MAE | $5,563 | On average, predictions are off by ~$5,563 |
| RMSE | $6,286 | Slightly higher than MAE — no major outlier errors |
| R² | 0.9982 | The model explains 99.82 % of the variance in house prices |
How to interpret each metric:
- MAE — average absolute error. Easy to understand: "our prediction is off by about $5,563 on average."
- RMSE — penalises large errors more heavily. When RMSE ≈ MAE, errors are evenly distributed.
- R² — proportion of variance explained. 0.9982 is excellent, though on a small dataset (50 rows) it may indicate overfitting. Cross-validation is recommended.
Visualise Predictions
Actual vs Predicted
plt.figure(figsize=(7, 5))
plt.scatter(y_test, y_pred, alpha=0.7, edgecolors='k', linewidths=0.4)
min_val = min(y_test.min(), y_pred.min())
max_val = max(y_test.max(), y_pred.max())
plt.plot([min_val, max_val], [min_val, max_val], 'r--', label='Perfect prediction')
plt.xlabel('Actual Price (USD)')
plt.ylabel('Predicted Price (USD)')
plt.title('Actual vs Predicted House Prices')
plt.legend()
plt.tight_layout()
plt.show()
Points close to the red dashed line indicate accurate predictions. Our model's predictions cluster tightly around the line.
Residual plot
residuals = y_test - y_pred
plt.figure(figsize=(7, 4))
plt.scatter(y_pred, residuals, alpha=0.7, edgecolors='k', linewidths=0.4)
plt.axhline(0, color='r', linestyle='--')
plt.xlabel('Predicted Price (USD)')
plt.ylabel('Residuals (USD)')
plt.title('Residual Plot')
plt.tight_layout()
plt.show()
How to read residuals:
| Pattern | Meaning |
|---|---|
| ✅ Random scatter around 0 | Good — errors are random, model fits well |
| ❌ Funnel shape | Heteroscedasticity — errors grow with larger predictions |
| ❌ Curved pattern | Non-linearity — consider polynomial features |
Our residuals fall between -$8,000 and +$8,000, which is small relative to house prices ranging from $88K to $560K.
Summary & Next Steps
| Metric | Value |
|---|---|
| MAE | ~$5,563 |
| RMSE | ~$6,286 |
| R² | 0.9982 |
Key Takeaways
house_size_sqftis the dominant predictor of house price.age_yearsanddistance_to_city_kmhave a negative effect on price, which is intuitive.- High multicollinearity between
age_yearsanddistance_to_city_km(0.99) makes individual coefficients unreliable — consider regularisation. - Linear Regression achieves a high R² on this dataset because the underlying relationships are roughly linear.
Possible Next Steps
- Try Ridge / Lasso regression to add regularisation and handle multicollinearity.
- Experiment with polynomial features to capture non-linear relationships.
- Use a Random Forest Regressor or Gradient Boosting for potentially higher accuracy.
- Apply cross-validation to verify the model is not overfitting.
Source Code
The complete Jupyter Notebook with all the code, outputs, and visualisations is available on GitHub:
🔗 regression_analysis.ipynb — keke78ui9/learn-ai
Feel free to clone the repository and experiment with the notebook yourself:
git clone https://github.com/keke78ui9/learn-ai.git
cd learn-ai/ML_regression_01
jupyter notebook regression_analysis.ipynb