Deep dive into postgres stats: pg_stat_bgwriter reports

Everything you always wanted to know about Postgres stats

Today, I would like to make a little detour from the main series and will dive into pg_stat_bgwriter. If you’ve been following my previous posts, you will remember that pg_stat_bgwriter view has summary statistics about bgwriter and checkpointer. Here I would like to show an interesting report query which is based on pg_stat_bgwriter. Sources of the query were found in postgres mailing lists and shared by my colleague Viсtor Yegorov and slightly modified by me. This report provides comprehensive information about bgwriter and checkpointer activity and helps to better configure them.
A tiny recommendation to run this query with expanded output in psql. Report produces only one row and looks like this :

-[ RECORD 1 ]--------------------------+---------------------------------------
Uptime                                 | 112 days 12:58:21.394156
Since stats reset                      | 132 days 14:37:34.149519
Forced checkpoint ratio (%)            | 7.6
Minutes between checkpoints            | 56.61
Average write time per checkpoint (s)  | 775.36
Average sync time per checkpoint (s)   | 0.21
Total MB written                       | 4194915.7
MB per checkpoint                      | 105.26
Checkpoint MBps                        | 0.03
Bgwriter MBps                          | 0.24
Backend MBps                           | 0.10
Total MBps                             | 0.37
New buffer allocation ratio            | 34.925
Clean by checkpoints (%)               | 8.5
Clean by bgwriter (%)                  | 64.3
Clean by backends (%)                  | 27.2
Bgwriter halt-only length (buffers)    | 0.00
Bgwriter halt ratio (%)                | 0.25
-------------------------------------- | --------------------------------------
checkpoints_timed                      | 3117
checkpoints_req                        | 256
checkpoint_write_time                  | 2615284047
checkpoint_sync_time                   | 716686
buffers_checkpoint                     | 45447168
buffers_clean                          | 345439953
maxwritten_clean                       | 861
buffers_backend                        | 146062083
buffers_backend_fsync                  | 0
buffers_alloc                          | 18752891779
stats_reset                            | 2016-10-12 23:41:35.737168+03
total_checkpoints                      | 3373
total_buffers                          | 536949204
startup                                | 2016-11-02 01:20:48.492531+03
checkpoint_timeout                     | 1h
checkpoint_segments                    | 256
checkpoint_completion_target           | 0.9

The report consists of two parts which are separated by horizontal dotted line, the first part is the report itself and the second is raw values from pg_stat_bgwriter and auxiliary items used in the report.

The first part is more interesting and here is why.

The first values that we see are the stats interval and uptime. General idea here is that the bigger the interval you have since last stats reset , the more inaccurate report you will get. Thus, I'd recommend resetting stats periodically, weekly or monthly. The uptime only shows how long postgres has been working. However, please note that in the example the uptime is less than stats collecting interval -- it's not bad, but for your report try using stats that were collected only after postgres startup, since at shutdown postgres runs forced checkpoint that isn’t related to the workload but might affect the stats.

Next is the information about checkpoints - "Forced checkpoint ratio" is the ratio of checkpoints which occurred by xlog. Current value is 7.6 and it's good enough. High values, for example more than 40%, indicate that xlog checkpoints occur too frequently. As you might remember, xlog checkpoints are less preferred than time checkpoints, so general idea is to reduce the number of xlog checkpoints by increasing the number of WAL segments required to trigger checkpoint. "Minutes between checkpoints" is time interval between occured checkpoints. When everything is ok, this value should be near checkpoint_timeout. Values significantly lower than checkpoint_timeout also indicate on occurrence frequency of xlog checkpoints. General recommendation in both cases is raise max_wal_size (or checkpoint_segments for 9.4 or over).

Next is the average write and sync time in seconds. Write time is near the 13 minutes and that's a good value, it shows that write stage of checkpoints was performed fast enough despite the long interval between checkpoints. Values that are closer to the earlier mentioned intervals between checkpoints aren't good - it’s an indicator that storage spent too much time writing buffers. Average sync time should be near zero - values that far from zero would indicate on low performance of the storage.

Next items have informative value, they tell us about average throughput of checkpointer, bgwriter and backends. These numbers give us additional information about the workload. The "Total MB written" is obviously the size of written data by all subprocesses. The "MB per checkpoint" is an average value for checkpoints. Next values are measured in Mbps and they are about process throughput. In the example above, there are low values less than 1 Mbps means that server doesn't have a huge amount of dirty data or maybe the report has been built with longer stats interval and throughput values are spread throughout.

"New buffer allocation ratio" field is the ratio of new allocated buffers to all written buffers. When backends handle data, firstly they check are data already in shared buffers area? If there are no required data in shared buffers, backends allocate new buffers and load data from main storage to shared buffers and then process it (see details in BufferAlloc() and StrategyGetBuffer() functions). Thus, high number here tell us that backends allocated a lot of buffers since required data didn't exist among shared buffers.

There are two reasons for this, the first is that the backends read rarely uses "cold" data, old archived partitions or something similar; the second reason is that the early used data had been evicted from shared buffers because of lack of shared buffers. That’s not all however, this number means how many times data were read to shared buffers more than it had been written out from them. This item potentially comes with cache hit ratio and high allocation ratio and low cache hit ratio can indicate insufficient shared buffers, though it’s hard to know for sure.

Next set of values are on how many buffers in percent are cleaned by the sub-processes. High "Clean by checkpoints" value is the sign of write-intensive workload. High "Clean by bgwriter" tells us about read workload. High number of "Clean by backends" is the sign that backends done a lot of bgwriter' work and that’s not good - values more than 50% tells us about ineffective setting of bgwriter, and in this case I would suggest trying to make it more aggressive.

Next values are "Bgwriter halt-only length" and "Bgwriter halt ratio". They are about frequency with which bgwriter was delayed due to exceeded bgwriter_lru_maxpages. The values in our example are perfect and high values conversely indicate that bgwriter went to sleep mode too frequently and didn't do its work fast enough. In this case, I'd recommend to configure bgwriter in a more aggressive way -- decrease delay and increase maxpages parameters.

The second part of the report is the raw values from pg_stat_bgwriter and configuration parameters which also related to bgwriter and checkpointer - they are used in report's query, hence, you don't need to see them in separate queries.

Here I prepared a few reports with my comments from different pgbench workloads:
  • here is eight hours of read-only workload with 4GB shared buffers (default bgwriter)
  • here is eight hours of read-only workload with 8GB shared buffers (default bgwriter)
  • here is eight hours of read/write workload with 4GB shared buffers (default bgwriter)
  • here is eight hours of read/write workload with 8GB shared buffers (default bgwriter)
  • here is eight hours of read/write workload with 8GB shared buffers (aggressive bgwriter)
Tests were performed on the server with 16CPU, 32GB RAM, RAID1 on 2xSSD (datadir), RAID1 on 2xSAS (wal) with PostgreSQL 9.6.2 and test database size is 96GB.

That is all for this time and I hope you enjoyed this post and found it helpful.


  1. Alexey, thanks for this great explanation!
    I noticed, that in read/write workload reports with default bgwriter settings changing only shared_buffers from 4GB to 8GB reduces the fsync time from 8ms to 1.85ms. Why does one affect the other?

    1. Yep, there is no any changes except shared_buffers and sync time is seemed to me suspicious too. Finally I think, issue lies is in the customer's SSDs, and I don't have any other ideas. The way to confirm/refute that is check disks with fio and take a look on latency spreading.


Post a Comment

Popular Posts