Preventing WAL growth on Postgres DB running on AWS RDS

Software Engineering
September 12, 2023

Artie is a fully managed database replication solution. From working with Postgres on AWS RDS, we have noticed that idle or dev databases will occasionally experience replication slot overflow and cause issues with customers. It’s also happened to us while testing out CDC pipelines. We wrote this guide to describe why this issue happens and how to prevent replication slot overflow!

Artie Cloud customers can avoid replication slot overflow for low traffic databases by enabling Heartbeats on the dashboard. For open source users or anyone running Postgres on AWS RDS in general, I hope this guide helps you understand and prevent this problem!

For anyone using Postgres on AWS RDS, it’s possible you’ve run into the scenario of an idle or unused dev database suddenly running out of memory and causing replication slot overflow. All of a sudden you’re up in the middle of the night trying to debug the issue and understand why your database is down.

What is causing replication slots to overflow?

When doing anything related to CDC (change data capture), a replication slot in Postgres is created. For Postgres specifically, CDC events are recorded and stored in the database’s WAL (write-ahead logging). WAL exists to ensure data integrity and log the records describing the changes. It’s also helpful for downstream applications to subscribe to and consume CDC events reliably.

Replication slot overflow happens when the WAL accumulates and grows, consumes all your database’s memory, and causes your database to go down. But, why does this happen to idle/unused databases?

The underlying reason is due to AWS heartbeats. AWS RDS periodically writes a heartbeat to a table in the database every 5 minutes , whether the database is being actively used or not. These heartbeats sit in the WAL and the default WAL segment size is 64MB. The purpose for AWS writing heartbeats is for various reasons, including to monitor the health of their databases.

For an active database, heartbeats do not cause an issue because the WAL is almost constantly being drained as new CDC logs are processed by downstream consumers. However, for idle databases with no CDC logs, heartbeats result in WAL accumulating by 64MB every 5 minutes, or 18.4GB per day! Heartbeats are not observed by most CDC applications, so it doesn’t get processed. After enough time has passed, the replication slot overflows and your database goes down.

How do we prevent WAL growth for a PostgreSQL database running on AWS RDS?

If you are using Debezium to read CDC logs from the WAL, you can turn on the heartbeats feature. You have to first create a heartbeat_table in the idle/low traffic database and Debezium will periodically ping the database with an update, which will create a CDC event and prevent WAL growth.

CREATE TABLE test_heartbeat_table (id text PRIMARY KEY, ts timestamp);

When heartbeats are enabled, Artie will then periodically issue this command to advance the replication slot.

UPDATE test_heartbeat_table set ts = now() where id = '1';

Other best practices to consider

  • Monitor your instance for free_storage_space and set up alerts
  • Enable storage autoscaling
  • Set statement_timeout and idle_in_transaction_session_timeout to prevent long-running queries
  • Configure max_slot_wal_keep_size to prevent replication slot overflow.
  • Each file size is 64mb, so if you want to keep 1GB, then you should set this to be 16 .
  • If the replication slot size is reached, the slot will be automatically dropped to protect the database from crashing.

Note that heartbeats is only necessary for low traffic and idle databases (dev or test) and not necessary for active databases.

Table of contents
    Author
    Jacqueline Cheong
    Co-founder & CEO