Questionlink
How to troubleshoot a stuck replication slot?
Environmentlink
Recommendationslink
A PostgreSQL replication slot may be stuck due to one of the following reasons:
-
The replication slot didn’t return any updates from the Write-Ahead Log (WAL) in the past three or more extract cycles.
-
The size of the replication slot is in hundreds of GB, approaching 1 TB.
-
The size of the replication slot is often reaching the
max_wal_size
limit.NOTE: If you are using PostgreSQL version 13, compare the slot size to the
max_slot_wal_keep_size
limit.
We recommend the following tuning recommendations for the max_wal_size
or max_slot_wal_keep_size
parameters. Consider our recommendations with your DBA:
-
Check if the
bgwriter
(background writer) process is healthy by inspecting its stats. Execute the following query:SELECT * FROM pg_stat_bgwriter;
.-
If the
maxwritten_clean
value is excessively high for your environment, increase thebgwriter_lru_maxpages
parameter. In a well-tuned environment, themaxwritten_clean
value should be 0. -
If the
buffers_backend
value is larger than thebuffers_clean
value, increase thebgwriter_lru_multiplier
parameter and decrease thebgwriter_delay
parameter.NOTE: The above condition may also indicate insufficient shared buffers. The hot part of your data is forced to travel between the RAM and disks.
-
The
buffers_backend_fsync
value indicates if the backend was forced to make its ownfsync
requests to synchronize the buffers with storage. A value above 0 points to problems with the storage when thefsync
queue is completely filled.
-
-
If you change the values of any of the above parameters, reset the
bgwriter
stats. Execute the following query:pg_stat_reset_shared('bgwriter');
. Check the stats again the next day. -
Check if the
max_wal_size
value is high enough to be rarely reached within your definedcheckpoint_timeout
window.-
For PostgreSQL versions 9.6 and below, do the following:
- Get LSN1. Execute:
postgres=# SELECT pg_current_xlog_insert_location();
. - Wait for the length of
checkpoint_timeout
value. - Get LSN2. Execute:
postgres=# SELECT pg_current_xlog_insert_location();
. - Get the total amount of WAL written during the checkpoint period. Execute:
postgres=# SELECT pg_xlog_location_diff('LSN2_VALUE', 'LSN1_VALUE');
. - Multiply the result from Step 4 by three.
- Collect a few data points. Repeat Steps 1 to 5 during a period of heavy database activity.
- Update
max_wal_size
to the result of Step 5.
- Get LSN1. Execute:
-
For PostgreSQL versions 9.7 and above, do the following:
- Get LSN1. Execute:
postgres=# SELECT pg_current_wal_insert_lsn();
. - Wait for the length of
checkpoint_timeout
value. - Get LSN2. Execute:
postgres=# SELECT pg_current_wal_insert_lsn();
. - Get the total amount of WAL written during the checkpoint period. Execute:
postgres=# SELECT pg_wal_lsn_diff('LSN2_VALUE', 'LSN1_VALUE');
. - Multiply the result from Step 4 by three.
- Collect a few data points. Repeat Steps 1 to 5 during a period of heavy database activity.
- Update
max_wal_size
to the result of Step 5.
- Get LSN1. Execute:
-
-
Increase
wal_buffers
. On very busy, high-core machines, it is useful to increase thewal_buffers
value to as high as 128 MB. Consult your DBA before increasing this value because the buffer value depends on the database’s hardware capacity. Fivetran recommends a minimum of 16 MB. -
For PostgreSQL versions 13 and above, if the
max_slot_wal_keep_size
parameter is not at the default (-1) value, increase the value to ensure two days of WAL retention. -
If the above steps doesn’t resolve the stuck replication slot, do the following:
- Pause the connector.
- Drop the replication slot.
- Wait for a checkpoint (dependent on the
checkpoint_timeout
value). - Recreate the replication slot.
- Ensure replication is in place. Execute:
SELECT * FROM pg_logical_slot_peek_changes('slot_name', NULL, 1);
.TIP: If the query fails, it indicates that replication is not in place.
- Go to the Setup tab of the connector, and click the Re-sync All Historical Data link.
- Unpause the connector.