Disable backups or provide a way to limit uncontrolled growth of wal directory #2813

pietrobaricco · 2021-10-28T09:47:13Z

Overview

I have a devlopment cluster which crashed because the disk got filled by wal files

Use Case

Development environments, where data can be rapidly restored from a dump and is not important anyway, don't need a backup system at all.

Desired Behavior

being able to disable pgbackrest or whatever is causing this uncontrolled growth of the wal directory.
having some clear instructions on how to recover from such situations, clearing the wal files for good.

Environment

Tell us about your environment:

Please provide the following details:

Platform: Kubernetes
Platform Version: 5.0.3
PGO Image Tag: ubi8-5.0.3-0
Postgres Version 13
Storage: native pvc
Number of Postgres clusters: 1

Edit: postgres logs are full of lines like this

ERROR: [099]: raised from remote-0 protocol on 'spadapgo-repo-host-0.spadapgo-pods.default.svc.cluster.local.': expected '{' at 'BRBLOCK-1' 2021-10-26 15:31:01.121 UTC [128] LOG: archive command failed with exit code 99 2021-10-26 15:31:01.121 UTC [128] DETAIL: The failed archive command was: pgbackrest --stanza=db archive-push "pg_wal/00000002000000070000003B"
Tried to run the command manually from the database container, and it did not produce errors, but the file is still there

in the end, I cleared the folder with pg_archivecleanup and the server resumed.
Maybe it's worth mentioning that the issue manifested itself since I enabled a cron job which refreshes ~20 materialized views from fdw tables every 30 minutes.

The text was updated successfully, but these errors were encountered:

jkatz · 2021-10-28T14:11:14Z

For managing the size of the backup repository, please see Managing Backup Retention.

pgBackRest also lets you set how much WAL archive you wish to retain:

https://pgbackrest.org/configuration.html#section-repository/option-repo-retention-archive

For disabling backups and more on this discussion, please see #2531

pietrobaricco · 2021-10-29T09:28:20Z

Thanks for your input, I already saw the linked discussion, I asked again because it seems people are still stuck with these issues. I sure am: yesterday I set up pgbackrest to perform 1 full backup every 4 hours, and to retain only 1 backup hoping that this would clear the WALs (as suggested by another poor soul here)

backups:
    pgbackrest:
      image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:centos8-2.33-2
      repoHost:
        dedicated: {}
      repos:
        - name: repo1
          schedules:
            full: "0 */4 * * *"
          volume:
            volumeClaimSpec:
              accessModes:
                - "ReadWriteOnce"
              resources:
                requests:
                  storage: 20Gi
      global:
        repo1-retention-full: "1"
        repo1-retention-full-type: count

This morning I have several failed pgbackrest pods (all failed because no space left on the device), pg WAL folder again hopelessly growing, and no way to do a manual expire because it can't even create some temp files it needs.

The database is much smaller than 20 Gb, the /pgdata folder, once cleared all the stuck WALs, is only 2 Gbs, and 99% of it is data in matviews.
What should be a reasonable backrest disk size for a situation like this (a ~2Gb database size, but most of it is refreshed every 30 minutes, no need for backups, 1 replica)?
Many thanks

Edit:

      global:
        repo1-retention-full: "1"
        repo1-retention-full-type: count
        archive-push-queue-max: 5G

it seems that adding the archive-push-queue-max kind of solves the issue. At least now I see postgres dropping the WALs

WARN: dropped WAL file '0000000400000036000000EC' because archive queue exceeded 5GB
WARN: dropped WAL file '0000000400000036000000ED' because archive queue exceeded 5GB

During my tests with pgbackrest I've encountered some random errors like

ERROR: [047]: unable to create path '/var/spool/pgbackrest/archive': [30] Read-only file system

which occurs enabling archive-async

or a mysterious error 82, described here. Note that in both cases I had plenty of disk space both on postgres and repo containers.
To fix error 82, I had to restart the pg cluster, as suggested in the linked issue.

jkatz added v5 duplicate labels Oct 28, 2021

jkatz closed this as completed Oct 28, 2021

yanmxa mentioned this issue Sep 11, 2023

Limit uncontrolled growth of the postgre pg_wal dir stolostron/multicluster-global-hub#628

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable backups or provide a way to limit uncontrolled growth of wal directory #2813

Disable backups or provide a way to limit uncontrolled growth of wal directory #2813

pietrobaricco commented Oct 28, 2021 •

edited

Loading

jkatz commented Oct 28, 2021

pietrobaricco commented Oct 29, 2021 •

edited

Loading

Disable backups or provide a way to limit uncontrolled growth of wal directory #2813

Disable backups or provide a way to limit uncontrolled growth of wal directory #2813

Comments

pietrobaricco commented Oct 28, 2021 • edited Loading

Overview

Use Case

Desired Behavior

Environment

jkatz commented Oct 28, 2021

pietrobaricco commented Oct 29, 2021 • edited Loading

pietrobaricco commented Oct 28, 2021 •

edited

Loading

pietrobaricco commented Oct 29, 2021 •

edited

Loading