Building a CDC for AWS Aurora Postgres

Sigal Shaharabani
Israeli Tech Radar
Published in
5 min readMay 28, 2020

--

In databases, change data capture (CDC) is a set of software design patterns used to determine and track the data that has changed so that action can be taken using the changed data.

Lately I’ve been working on a CDC (Change Data Capture) system, that tracks changes from Postgres-based databases and sends them to another destination.

This is a very common scenario for a CDC — syncing data between databases. Thus my research started with checking what other, including Debezium, use to capture data changes from Postgres, which resulted with WAL.

Write Ahead Logs (aka WAL), are the transaction logs of Postgres. Briefly, data changes are written only after they have been logged. This means that by following the stream of data in the WAL, you know about any data change in your tables (except truncate) and can take action. Due to some business and runtime restrictions, I chose to write the WAL listener on my own in Go.

However, when dealing with AWS managed databases I dealt with gaps in the available materials, and I hope to bridge these gaps in this post.

Lets bridge the gap
Photo by Farnoosh Abdollahi on Unsplash

Let’s start with describing the planned solution: Use the pub-sub model, the Postgres would publish data changes to one or more subscribers.

Postgres WAL is processes by a WAL parser
My basic idea

The reasons for choosing the solution are:

  • Monitoring data changes based on WAL is more reliable
  • It is fairly easy to get started

Creating a Postgres change listener

The instructions of starting to follow WAL are simple:

  1. Enable Logical replication on your Postgres
  2. Create a replication slot
  3. Optionally add wal2json to receive new data as JSon messages
  4. Connect to the database with the url option replication=database
  5. Start consuming your data

Since I am writing a CDC in Go I wrote a WAL consumer similar to pg_listener (I used newer versions of the Postgres connection libraries).

The WALs are decoded into the replication slot, wal2json transforms the data to JSon which is then parsed
The complete solution

At this point I was pretty happy with myself, I was able to quickly create an application that listens to Postgres changes and I could now attend to the data at its destination.

However, AWS RDS

A CDC is written to transmit data changes from a source to a destination, so it is not developed in a vacuum, the source and destination systems have their own architecture that I had to adapt to.

In some of the setups the source Postgres can be either:

  • AWS RDS Postgres
  • AWS RDS Aurora

AWS Relational Database Service (RDS) is a managed relational database service, one of the optional database engines to choose from is Postgres. It provides resizable capacity while automating time-consuming administration tasks.

AWS RDS supports Aurora, Postgres, MySQL, MariaDB, Oracle and SQL Server
RDS Supported engines

AWS RDS Aurora is a MySQL and Postgres compatible relational database built for the cloud, it gives services such as scale, fault tolerance and more. Amazon Aurora is up to five times faster than standard MySQL databases and three times faster than standard PostgreSQL databases.

Reading many blog posts, discussions and guides I could understand that my code should not change, but I wasn’t quite sure how to make the RDS databases replicated data be available to the listener.

RDS Postgres

I found a blog post in the AWS website, showing how the technique I described in the beginning of this post can be used.

RDS Aurora

While looking for materials about RDS Aurora Postgres and its logical replication I came up with few results, so I had to rely entirely on the following guide.

The first 2 steps are:

  1. Enable logical replication
  2. Create a publication for multiple tables (one, list or all tables are supported)

However, the next step was not clear. The guide stated:

For this example, table data is replicated from an Aurora PostgreSQL database as the publisher to a PostgreSQL database as the subscriber. Note that a subscriber database can be an RDS PostgreSQL database or an Aurora PostgreSQL database. A subscriber can also be an application that uses PosgreSQL logical replication. After the logical replication mechanism is set up, changes on the publisher are continually sent to the subscriber as they occur.

From this I understood that I do not need to create a Postgres subscriber. However, since creating the subscriber also results in a replication slot, I still had to do the following on my own:

  1. Create a replication slot
  2. Add Wal2Json (because the code was already based on JSon messages)

The following command solves both problems, run it in your favourite Postgres client:

SELECT * FROM pg_create_logical_replication_slot(<slot name>, 'wal2json')

Yes, both RDS Postgres and RDS Aurora permit you to start using wal2json immediately.

Notes

  1. Publications and Subscriptions were introduced in Postgres 10, and are available for Aurora as of version 2.2.0
  2. At this point I had to change my original Go code because in that code the listener specified the list of tables to listen to

At this point I started the Go application, and data started to flow. Success!

The publication monitors the WAL for changes in the configured tables
Solution for Postgres, RDS Postgres and RDS Aurora

Though I had some trouble getting started with AWS RDS, IMHO, basing the CDC on WAL remains a more reliable solution than writing my own system to detect data changes in the database.

I do recommend spending time reading about:

  • Logical Replication
  • WAL2JSON if you are planning to use it (and I do recommend using it)
  • Postgres publications and subscriptions

Who I am

My name is Sigal Shaharabani — I am a data engineering technical leader in Tikal.

--

--