pg_wal is too big… what's going on?

What’s common between DBA and detective? They both solve murder and mystery while trying to make sense of the nonsense.

Recently, one of our clients had Postgres crash due to the lack of free space on the database server. Client's reaction was instantaneous - a hot standby was immediately promoted to master and the application continued working. Some details about the crash were saved and since it’s the age of clouds, devops and microservices, the failed host was recycled into a new hot standby.

Once the dust has settled, the client gave us saved information about the failure asking: What was it? How can I avoid it in the future?

The first thing that we’ve done is what any good detective would do - started looking for clues. Client provided us with crash logs and said that pg_xlog directory was suspiciously big while there were only around 2 thousand files with 16MB size each. We looked into postgresql.log to see what were the last lines before crash:

PANIC: could not write to file "pg_xlog/xlogtemp.9023": No space left on device
LOG: WAL writer process (PID 9023) was terminated by signal 6: Aborted
LOG: terminating any other active server processes


Looking at other log messages we saw another FATAL record:

FATAL: archive command failed with exit code 131
DETAIL: The failed archive command was: /opt/utils/pgdb/wal-archiver pg_xlog/000000010000000000000003


That was the answer to the first question. From these two log messages, we concluded that free space was completely exhausted due to failed archive command - Postgres could not write its transaction log and ultimately crashed.

But how we can prevent that from happening again and what to do in case it does happen?
Let me take a little detour here, for those who are unfamiliar with pg_xlog. pg_xlog is a directory where Postgres keeps its transaction log (also known as XLOG or WAL) which is used for recovery purposes. For various reasons, people often delete files from this directory because its name contains word "log" and one inclined to think "These are just logs, nothing will happen if I delete them". In fact, pg_xlog plays a vital role in the database's life! Furthermore, to prevent situations when these files are accidently deleted pg_xlog has been renamed to pg_wal in Postgres 10. Note: in this article, I will use the newer term ‘pg_wal’ for both ‘pg_xlog’ and ‘pg_wal’.

Now, let's get back to our mystery, how can we avoid being left without space on our device?

The most obvious answer here is monitoring. The amount of free or used disk space always must be kept under control and there has to be a system that would alert you in case of reaching a threshold. It helps to begin to work on an issue before it becomes a real problem. If there is no monitoring systems, disk space can be checked with "df" utility.

Let's imagine that through the monitoring system or otherwise you get an alert that the free space is reaching dangerous threshold levels and this is an indication of sudden and uncontrollable consumption in pg_wal directory. The very first golden rule here is "Do NOT delete any files from pg_wal directory!" - otherwise your database might turn into a pumpkin.

Here are two ways to go about it: the first - is to take an emergency action, this is the last resort when there is only 2% or less of free disc space, lack of time to find the cause of the problem and the main goal of your actions is to avoid the risk of the database crash halt. The second - is set of actions directed to find out the cause of the space consumption - something you can only do when there is no risk of the database's emergency stop. Let's consider both ways.

In case, when there is no free disk space left and the time plays against us, the main goal is to avoid the complete halt of the database. Intuitively, the most obvious action here is to delete something, this is, however, not the best practice, especially if you are in a hurry and chances are you will make an irreversible damage by deleting essential database files. In fact, it is much better in this situation to add space - this is not an easy solution and might have some limitations, however, it’s much better than random deletion. So where can we get that extra space? There are two options, and both depend on your OS installation settings:

Reserved space on Ext filesystems.
By default Ext filesystems formatted with reserved space of about 5% for super-user files - this allows super-user or root-owned daemons, to continue functioning correctly after non-privileged processes are prevented from writing to the filesystem. Reserved space is invisible and by using 'tune2fs -m' it's possible to reduce this setting and return that space from root-only to shared usage. The larger is your filesystem, the more noticeable would be the effect of such action. When the main cause of the problem has been solved, the reserved space percentage can be then turned back on. Do keep in mind that an alternative filesystem - XFS has no reserved space options, so this trick wouldn’t work on it.

Free space in the pool or the volume group.
In case when the database is hosted on LVM or ZFS a different emergency action may be useful. If the pool or the volume group have free space, it's possible to add this space to the filesystem. Later on, when the problem will be finally solved, the added space can be returned back to the pool. This approach is a bit more tricky than the one above, to do than in first way, but is definitely an option. To do this, you need to shrink filesystem and than reduce the volume size. This is something that I would advise doing only if you are an experienced DBA, so if you want to take this route it’s highly advisable to test your skills in test environment first, so that you are prepared for doing it live when needed.

After emergency actions are complete and there is no risk of panic shutdown for the database, you can start investigating the source of your problems - why did pg_wal started bloating in the first place? There are two most popular scenarios here.

Failed archive_command.
WAL archiving assumes the usage of archive_mode and archive_command parameters, and the latter has one important detail - if archive_command fails with any error (non-zero exit code). Depending on the WAL traffic, pg_wal directory starts to bloat and the only way to stop it is to fix archive_command. To detect problems with archive_command, you can use pg_stat_archiver view which has information about the number of failed attempts to archive segments and an exact time of the last failed attempt. If you use WAL archiving with archive_command, it's highly recommended to monitor pg_stat_archiver stats.

Another way to shoot yourself in the foot is semi-configured WAL archiving on a standby host. For known reasons archive_command works on standby only when: 1) 'archive_mode = always'; or 2) standby promotes to master. Thus, standby hosts usually configured with 'archive_mode = on' and this is ok, but on one fine day when a standby is promoted to master, WAL archiving will become enabled. The experience shows that at this point, the most needed archiving scripts or packages on promoted master are absent and archive_command will fail. So when you use the same configuration files on all nodes of Postgres cluster please keep this in mind.

Forgotten replication slot.
Another cause for pg_wal bloat are forgotten replication slots. Slots have been introduced in Postgres 9.4 and are designed to keep WAL segments for safety purposes in cases when, the client that has been subscribed to a slot, unexpectedly goes down for a prolonged period of time. Before introduction of replication slots, wal_keep_segments have been used for this purpose. However, this parameter allowed to only keep a specified number of segments and there was a risk of segments being recycled (which led to inability of the standby to connect to the primary server). Replication slots don't have this limitations and can keep segments until there is available disk space. Thus, they also might cause pg_wal bloat.

There are also cases when software, that requires connection to Postgres is used through replication slots, and what happens is that users forget to remove the slot once they stopped using the software. This forgotten slot remains active in the system and silently continues to occupy the segments… and you can imagine the rest… There is pg_replication_slots view which helps to observe replication slots and their state, the view is very handy when using slots.

In cases above and similar, once you have fixed the problem you should (!) connect to Postgres and run CHECKPOINT command two-three times. This command starts the procedure of creating a recovery point and during it Postgres removes unneeded segments from pg_wal. But note, this procedure when running, might cause extra IO and drop queries' performance, especially in case of large shared buffers and high write traffic.

Of course, these are not the only issues that may cause pg_wal to bloat, but they are the most frequent ones, in my practice. If you would like to see more, check Postgres mailing lists where you can find some additional examples.

Do you have any interesting cases of pg_wal bloat? Share them in comments below!

Comments

  1. I encountered an application recently which had a bug in a nightly cleanup script. It did a big delete and was missing a where clause which caused it to run into a foreign key error when it tried to do the delete. For whatever reason, the application would loop retrying the cleanup query for 1 hour. In that hour, over 8000 WAL files were generated instead of the usual <200. From what I now understand, PostgreSQL checks the foreign key constraints after writing the updated blocks to the WAL, then when it hit the violation, had to write the records back via more WAL updates. As the application continually retried the query for an hour, the server basically generated WAL data as fast as it could manage to do for that period of time. The pg_waldump utility can be used to figure out what's going on in this case, as well as looking for errors in the log.

    ReplyDelete
    Replies
    1. Looks interesting, I'd like to reproduce that, do you have a simple test case?

      Delete

Post a Comment

Popular Posts