How to Fix 502 and 503 Gateway Errors on AWS EC2 with an Application Load Balancer

2026/02/149 min read
bookmark this

Table of Contents

  1. Introduction
  2. The Setup
  3. Why 503 Happens — Bad Health Check Path
  4. Why 502 Happens — App Crash and OOM on t2.micro
  5. Fix 1 — Dedicated Health Check Endpoint
  6. Fix 2 — Replace Single EC2 with an Auto Scaling Group
  7. Fix 3 — Harden the Bootstrap Script
  8. Fix 4 — Reduce Per-Request Memory with Static Generation
  9. Lessons Learned
  10. Conclusion

Introduction

Running a Node.js web app on a single AWS EC2 instance behind an Application Load Balancer (ALB) is a common and affordable setup. However, it comes with some sharp edges that can cause 502 Bad Gateway and 503 Service Unavailable errors — often at the worst possible time: when real traffic is arriving.

This post walks through the exact root causes of both error types in this kind of setup and the changes made to eliminate them. No magic — just a clear explanation of what went wrong and what fixed it.


The Setup

The application is a Next.js blog running on a t2.micro EC2 instance (1 vCPU, 1 GB RAM) in AWS us-west-1. Traffic flows through:

User → Route 53 → ALB (HTTPS/HTTP listeners) → Target Group → EC2 (port 80)

The EC2 instance was provisioned via Terraform and bootstrapped using a user_data shell script that:

  • Installed Node.js
  • Cloned the repo from GitHub
  • Ran npm install and npm run build
  • Started the app with PM2

Sounds simple. But several subtle mistakes caused recurring outages.


Why 503 Happens — Bad Health Check Path

Root cause

The ALB Target Group health check was configured to hit / — the homepage. The homepage used getServerSideProps in Next.js, which on every request:

  1. Read all markdown files from disk
  2. Called Firestore (cloud database) to look up a session
  3. Returned the rendered HTML

If the app was starting up or under memory pressure, this SSR request would take too long or crash entirely. The ALB would mark the instance as unhealthy and return 503 Service Unavailable to all incoming visitors — even if a static asset like /health.html would have responded fine.

Why this is a problem

The ALB health check fires every 30 seconds. Each failed check counts toward the unhealthy threshold (typically 3 consecutive failures). With a heavy SSR endpoint as the health target, a brief spike of memory usage is enough to trigger a full outage.

ALB → GET / → SSR + Firestore call → timeout → instance marked unhealthy → 503

Why 502 Happens — App Crash and OOM on t2.micro

Root cause

A 502 Bad Gateway from an ALB means the ALB reached the instance, but the instance returned no valid HTTP response — usually because the application process has crashed or is not listening on the expected port.

On a t2.micro with 1 GB of RAM, a Next.js app with SSR can consume 300–500 MB just at idle. Under daytime traffic spikes:

  • Multiple concurrent SSR requests each consume memory
  • Node.js default heap limit is around 700 MB on 32-bit-like environments
  • The process hits the heap limit, triggers an OOM (Out of Memory) error, and crashes

PM2 is supposed to restart the process — but the original setup had several bugs in the init script that meant PM2 was not configured correctly, leading to longer downtime windows.

Additionally, the init script had a critical bug:

# WRONG — commands after sudo su run in a subshell and then exit
sudo su
cd /home/ec2-user
# ... rest of commands never run as expected

This caused the app to sometimes never start at all, leading to 502 from the very first health check.


Fix 1 — Dedicated Health Check Endpoint

Create a minimal static HTML file at public/health.html:

<!DOCTYPE html>
<html>
  <body>
    ok
  </body>
</html>

Then update the Terraform Target Group health check to point to it:

health_check {
  path                = "/health.html"
  port                = "80"
  healthy_threshold   = 3
  unhealthy_threshold = 3
  interval            = 30
  timeout             = 10
  matcher             = "200"
}

Why this works: /health.html is served directly from disk by Next.js's static file handler. It never touches Node.js SSR logic, Firestore, or any database. It is always fast and always available as long as the process is running.


Fix 2 — Replace Single EC2 with an Auto Scaling Group

A single EC2 instance with no self-healing means any crash = manual recovery. The fix is to use an Auto Scaling Group (ASG) with desired=1:

resource "aws_autoscaling_group" "myAsg" {
  desired_capacity          = 1
  min_size                  = 1
  max_size                  = 2
  health_check_type         = "ELB"
  health_check_grace_period = 720

  launch_template {
    id      = aws_launch_template.myLaunchTemplate.id
    version = "$Latest"
  }
}

Key settings:

  • health_check_type = "ELB" — ASG uses ALB health checks (not just EC2 ping). If the app stops responding, ASG terminates and replaces the instance automatically.
  • health_check_grace_period = 720 — give the bootstrap script 12 minutes to complete npm install + next build before the first health check fires. On a t2.micro this takes 8–10 minutes.
  • version = "$Latest" — always boot from the newest launch template version, so deployments just require replacing the template and terminating the old instance.

Also ensure the launch template network interface has a public IP so the instance can reach GitHub and npm registries during bootstrap:

network_interfaces {
  associate_public_ip_address = true
  security_groups             = [aws_security_group.mySecurityGroup.id]
}

Fix 3 — Harden the Bootstrap Script

The original user_data script had several problems. Here is what was wrong and what replaced it:

Problem: sudo su breaks subsequent commands

# WRONG
sudo su
cd /app
npm install   # this runs in a subshell, then exits

EC2 user_data scripts already run as root. Remove sudo su entirely.

Problem: PM2 flags passed to the wrong process

# WRONG — --max-restarts and --restart-delay go to Next.js, not PM2
pm2 start npm --name "myApp" -- run "start" --max-restarts 10 --restart-delay 5000

# CORRECT — PM2 flags must come before --
pm2 start npm --name "myApp" --max-restarts 10 --restart-delay 5000 -- run "start"

Problem: No PM2 startup persistence

If the instance reboots, PM2 processes are lost. Fix:

pm2 startup systemd -u root --hp /root
pm2 save

Problem: Node.js heap unbounded

export NODE_OPTIONS="--max-old-space-size=512"

Cap the heap at 512 MB so Node.js throws a controlled OOM error and PM2 can restart it, rather than the OS killing the process unpredictably.

Final bootstrap script (simplified)

#!/bin/bash
set -euo pipefail

# Install Node 18.x
curl -fsSL https://rpm.nodesource.com/setup_18.x | bash -
yum install -y nodejs git

# Install PM2
npm install -g pm2

# Clone and build
cd /home/ec2-user
git clone https://YOUR_TOKEN@github.com/YOUR_ORG/YOUR_REPO.git app
cd app
npm install
npm run build

# Start with memory cap
export NODE_OPTIONS="--max-old-space-size=512"
pm2 start npm --name "myApp" --max-restarts 10 --restart-delay 5000 -- run "start"

# Persist across reboots
pm2 startup systemd -u root --hp /root
pm2 save

Fix 4 — Reduce Per-Request Memory with Static Generation

The most impactful application-level change was converting the highest-traffic pages from Server-Side Rendering (SSR) to Static Site Generation (SSG) with Incremental Static Regeneration (ISR).

Before (SSR)

// Every single visitor triggers this — Firestore calls, markdown parsing, SSR
export async function getServerSideProps({ req, res }) {
  const posts = DataService.getPostByPaging(1);
  const sessionKey = await CookieHelper.GetSessionKey(req, res);
  const result = await GetSession(sessionKey); // Firestore call
  const user = result?.user ?? null;
  return { props: { posts, user, sessionKey } };
}

After (ISR)

// Built once at deploy time, re-rendered in background every 5 minutes max
export async function getStaticProps() {
  const posts = DataService.getPostByPaging(1);
  return {
    props: { posts, sessionKey: '', user: null },
    revalidate: 300,
  };
}

Result: The homepage goes from "SSR + 2× Firestore calls per visitor" to "serve pre-built HTML from disk". On a t2.micro this is the difference between 200–400 MB of memory churn per traffic spike and near-zero.

For blog post detail pages, the same approach works with getStaticPaths:

export async function getStaticPaths() {
  const posts = DataService.getBlogPosts();
  return {
    paths: posts.map((p) => ({ params: { id: p.id } })),
    fallback: 'blocking', // new posts render on first hit, then cached
  };
}

fallback: 'blocking' means posts published after the last deploy are still accessible — the first visitor triggers a one-time build, and everyone after gets the cached static version.


Lessons Learned

Mistake Impact Fix
Health check hitting SSR endpoint 503 on memory spikes Use static /health.html
Single EC2 with no ASG Manual recovery after crash ASG with health_check_type = "ELB"
sudo su in user_data App never started Remove it — user_data runs as root
PM2 flags after -- Flags ignored by PM2 Move flags before --
No PM2 startup/save App lost on reboot pm2 startup && pm2 save
No Node.js heap cap Uncontrolled OOM crash --max-old-space-size=512
SSR with Firestore on every request Memory pressure, latency getStaticProps + ISR
No public IP on launch template Bootstrap failed silently associate_public_ip_address = true
Grace period too short (300s) ASG killed instances mid-build Increased to 720s

Conclusion

502 and 503 errors on AWS ALB + EC2 setups are almost always one of three things:

  1. The health check target is too expensive — use a static file
  2. The app process crashed and nothing restarted it — use ASG with ELB health checks
  3. The bootstrap script has a subtle bug — test it thoroughly; run it manually once before automating

The combination of a static health check endpoint, an ASG for self-healing, a correct bootstrap script, and converting hot pages to ISR eliminated recurring outages and significantly reduced memory pressure on the instance.

The changes described here apply to any Node.js application hosted on EC2 behind an ALB — not just Next.js blogs.