PostgreSQL Performance Meltdown?

Overview


On the 30th of December 2017, Kernel page-table isolation (PTI) patch had been merged
with the mainline. Its main goal was to mitigate the Meltdown bug.
This patch has also been backpatched into the stable releases by RedHat, Debian, CentOS, Ubuntu and other Linux vendors. Therefore, the nearest security upgrade of the servers running on such stable releases will get PTI feature.

Unfortunately, this bugfix comes with the performance penalty.
Andres Freund provided figures, showing that performance might drop by 7-23% on TPC-H alike workloads.

Meltdown vulnerability “allows a rogue process to read any kernel memory, even when it is not authorized to do so”. PTI is designed to close this flaw, i.e. restrict memory access to the unauthorized areas. Of course, this comes at a price and it’d be good to find out what performance impact it will have after Linux kernel upgrade. As PTI affects memory-based operations, one way to achieve this is to use database that fits into the RAM of the server and to perform a series of pgbench-based RO tests.

Test setup


CPU: Intel(R) Xeon(R) CPU E5-2430 0 @ 2.20GHz
RAM: 32GB (swap 16GB)
Disks: Dell PERC H710, 120GB EDGE E3 SSD in RAID1 (110GB available)
Linux: CentOS 7.3.1611
PostgreSQL: 10.1

PostgreSQL configuration changes:
bgwriter_delay = 10ms bgwriter_lru_maxpages = 1000 bgwriter_lru_multiplier = 10.0 checkpoint_timeout = 1h max_wal_size = 8GB min_wal_size = 2GB checkpoint_completion_target = 0.9 log_destination = stderr logging_collector = on log_directory = /var/log/postgresql log_filename = postgresql-%a.log log_truncate_on_rotation = on log_rotation_age = 1d log_rotation_size = 0 log_checkpoints = on log_line_prefix = ‘%m %p %u@%d from %h [vxid:%v txid:%x] [%i] ’ log_lock_waits = on log_temp_files = 0

Test details


Test database has been prepared with: pgbench -i -s 1700 pgbench
Database size: 26 035 MB

The following kernels and options had been tested
  1. Old kernel version: 3.10.0-514.el7.x86_64
  2. New kernel version, as-is: 3.10.0-693.11.6.el7.x86_64-pti-pcid
  3. New kernel, with nopcid: 3.10.0-693.11.6.el7.x86_64-pti-nopcid
  4. New kernel, with nopti: 3.10.0-693.11.6.el7.x86_64-nopti-pcid
  5. New kernel, with nopti nopcid: 3.10.0-693.11.6.el7.x86_64-nopti-nopcid
For each of the 5 “kernels”, 3 variants of shared buffers have been observed: 1GB, 8GB and 28GB.
After each restart of the cluster, tables and indexes have been pre-warmed via `pg_prewarm` module. Then, 15-minute long read-only pgbench run has been conducted.

Script of the test:
echo "shared_buffers=${SB}" > /var/lib/pgsql/10/data/postgresql.auto.conf
pg_ctl -D ~/10/data -m fast restart
echo "SELECT format('SELECT %L, pg_prewarm(%L);', relname, relname) FROM pg_class c JOIN pg_namespace n ON n.oid=c.relnamespace AND n.nspname='public' WHERE relkind IN ('r','i')\gexec" |psql -qAtX pgbench
pgbench -T 900 -j4 -M prepared -c 12 -Sr pgbench

Results


Alias
TPS (1GB)
TPS (8GB)
TPS (28GB)
old
92995
80554
107144
pti pcid
88394 (-4.9%)
77586 (-3.7%)
102839 (-4.0%)
pti nopcid
83216 (-10.5%)
74947 (-7.0%)
98653 (-7.9%)
nopti pcid
90772 (-2.4%)
79120 (-1.8%)
105726 (-1.3%)
nopti nopcid
91387 (-1.7%)
78856 (-2.1%)
105593 (-1.4%)
Test shows, that new PTI-enabled kernel brings ~4-5% slowdown for in-memory read-only operations. Using new kernel parameters nopti and nopcid (together) it is possible to disable new functionality, however, small performance reduction is still observed.
There are no general directions on whether to disable PTI, after all, its goal is to close HW-based bug. In cases, when server is dedicated to the database alone and it is a physical machine (not VM or container), it seems fine to use nopti parameter and get a better performance.
Here’s a graph, that shows TPS for the old kernel, new (PTI-enabled) one and new kernel with PTI and PCID features disabled:

For the chosen test case performance penalty is not huge. And one can clearly see, why using 50% of RAM for shared buffers is not recommended ;)

Also, since larger shared buffers yield more TPS and can significantly reduce IO volume (due to only checkpoints are writing dirty blocks) it seems beneficial to use 70-80% of RAM for this purpose. This requires more tuning on the kernel level, though — HugePages are a must for such setups.

Comments

  1. The large shared buffers can be problem in write intensive load - in read only prewarmed load probably there are not limit.

    ReplyDelete
    Replies
    1. Yes, this is true. Still, in my experience it happened just once, when it was necessary to reduce shared buffers on a write-intensive database to speed things up.

      For the typical OLTP workload, though, focus is on making shared buffers around 75% of the RAM, as this yields better query and checkpoint performance, especially if working data set fits into the provided cache.

      Delete
    2. It depends - 1. if you have too big SB, then you have to have low work_mem - it can be problem. 2. The performance issues related to SB are related to absolute size of SB (larger than 10-30GB .. depends on usage). Bigger SB needs more precious tuning of bgwriter and checkpoints. I seen slowdown when we increased SB from 7GB to 10GB (although the slowdown was small) - the server has 100GB RAM.

      Delete
    3. I agree, proper tuning of memory, bgwriter and checkpoints, as well as Linux kernel, is a must for large SB cases.
      And we do have production databases running with more than 100GB large SB.

      Delete

Post a Comment

Popular Posts