Skip to content

Disable backups or provide a way to limit uncontrolled growth of wal directory #2813

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pietrobaricco opened this issue Oct 28, 2021 · 2 comments

Comments

@pietrobaricco
Copy link

pietrobaricco commented Oct 28, 2021

Overview

I have a devlopment cluster which crashed because the disk got filled by wal files

Use Case

Development environments, where data can be rapidly restored from a dump and is not important anyway, don't need a backup system at all.

Desired Behavior

  • being able to disable pgbackrest or whatever is causing this uncontrolled growth of the wal directory.
  • having some clear instructions on how to recover from such situations, clearing the wal files for good.

Environment

Tell us about your environment:

Please provide the following details:

  • Platform: Kubernetes
  • Platform Version: 5.0.3
  • PGO Image Tag: ubi8-5.0.3-0
  • Postgres Version 13
  • Storage: native pvc
  • Number of Postgres clusters: 1

Edit: postgres logs are full of lines like this

ERROR: [099]: raised from remote-0 protocol on 'spadapgo-repo-host-0.spadapgo-pods.default.svc.cluster.local.': expected '{' at 'BRBLOCK-1' 2021-10-26 15:31:01.121 UTC [128] LOG: archive command failed with exit code 99 2021-10-26 15:31:01.121 UTC [128] DETAIL: The failed archive command was: pgbackrest --stanza=db archive-push "pg_wal/00000002000000070000003B"
Tried to run the command manually from the database container, and it did not produce errors, but the file is still there

in the end, I cleared the folder with pg_archivecleanup and the server resumed.
Maybe it's worth mentioning that the issue manifested itself since I enabled a cron job which refreshes ~20 materialized views from fdw tables every 30 minutes.

@jkatz
Copy link
Contributor

jkatz commented Oct 28, 2021

For managing the size of the backup repository, please see Managing Backup Retention.

pgBackRest also lets you set how much WAL archive you wish to retain:

https://pgbackrest.org/configuration.html#section-repository/option-repo-retention-archive

For disabling backups and more on this discussion, please see #2531

@jkatz jkatz closed this as completed Oct 28, 2021
@pietrobaricco
Copy link
Author

pietrobaricco commented Oct 29, 2021

Thanks for your input, I already saw the linked discussion, I asked again because it seems people are still stuck with these issues. I sure am: yesterday I set up pgbackrest to perform 1 full backup every 4 hours, and to retain only 1 backup hoping that this would clear the WALs (as suggested by another poor soul here)

backups:
    pgbackrest:
      image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:centos8-2.33-2
      repoHost:
        dedicated: {}
      repos:
        - name: repo1
          schedules:
            full: "0 */4 * * *"
          volume:
            volumeClaimSpec:
              accessModes:
                - "ReadWriteOnce"
              resources:
                requests:
                  storage: 20Gi
      global:
        repo1-retention-full: "1"
        repo1-retention-full-type: count

This morning I have several failed pgbackrest pods (all failed because no space left on the device), pg WAL folder again hopelessly growing, and no way to do a manual expire because it can't even create some temp files it needs.

The database is much smaller than 20 Gb, the /pgdata folder, once cleared all the stuck WALs, is only 2 Gbs, and 99% of it is data in matviews.
What should be a reasonable backrest disk size for a situation like this (a ~2Gb database size, but most of it is refreshed every 30 minutes, no need for backups, 1 replica)?
Many thanks

Edit:

      global:
        repo1-retention-full: "1"
        repo1-retention-full-type: count
        archive-push-queue-max: 5G

it seems that adding the archive-push-queue-max kind of solves the issue. At least now I see postgres dropping the WALs

WARN: dropped WAL file '0000000400000036000000EC' because archive queue exceeded 5GB
WARN: dropped WAL file '0000000400000036000000ED' because archive queue exceeded 5GB

During my tests with pgbackrest I've encountered some random errors like

ERROR: [047]: unable to create path '/var/spool/pgbackrest/archive': [30] Read-only file system

which occurs enabling archive-async

or a mysterious error 82, described here. Note that in both cases I had plenty of disk space both on postgres and repo containers.
To fix error 82, I had to restart the pg cluster, as suggested in the linked issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants