PostgreSQL Bloatbusters.

Or when the vacuum is not enough.

Anyone who works with PostgreSQL, whether a DBA or a developer, at some point encountered vacuum. All DBAs have configured vacuum and most developers have experienced decreased database performance and slow query performance due to the vacuum.
Though vacuum may seems as a black box to developers, most DBAs know how important vacuum and its configurations are. To those who are unfamiliar with vacuum, it is basically a garbage collector which cleans the database. Someone might ask - what does it clean my database from? Here, I would like to provide a brief overview, whilst you can find a detailed answer in the official PostgreSQL documentation.




Whilst handling data PostgreSQL provides individual snapshots of it to each client. Eventually, with the update and delete operations some data in the snapshots become obsolete, but still exist in the database and occupy space in tables’ and indexes’ datafiles. The result is datafile fragmentation and overall performance degradation. Here is where vacuum comes into play, its main goal is to clean this unneeded data (the dead rows) from tables and indexes, and to free up space for new data.

Of course, the world isn’t perfect and vacuum has its own particularities. For example, after vacuum has done its work, the space doesn’t actually becomes free, it’s only available for reuse within vacuumed table or index. Thus, it may seem that the size of the table or index has been reduced. In fact, after vacuuming the size is exactly the same as before the vacuum. This is something that many beginner Postgres users are confused by.

Another example are specific workloads where vacuum is powerless or is unable to do its work efficiently. Let’s take idle transactions, they defer cleanup of dead rows and lead to tables or indexes bloat. Therefore, idle transactions, and even regular long transactions harm database and must be avoided.

Another interesting example is the intensively updated tables with high number of indexes, in this case, when the vacuum processes such tables and their indexes, a high amount of updates trigger a new vacuum on that table, and we might see endless vacuum on that table. Even the aggressively configured vacuum wouldn’t be helpful in this case and the result - tables bloat.

You might also remember other similar cases with the same outcome - none of the vacuum configurations help and it seems impossible to return the occupied space. Of course, Postgres has VACUUM FULL feature, but it blocks access to the table, not only for writes but also for read SELECT queries. This is often unacceptable especially in production environment.

There is a way out, however, PostgreSQL is an open source and it has a great community. Community offers many tools for PostgreSQL and there are few tools which address VACUUM FULL limitations and do its work without blocking access to the processed table.

The first tool is pg_repack. It works like VACUUM FULL on steroids, it creates а new table and moves data from the old one to the new. To avoid the exclusive table locking, an additional log table is created to log changes made in the original table, and added a trigger that logs INSERT/UPDATE/DELETEs to the log table. When data from original tables has been imported into the new one and indexes are rebuilt, the changes from the log table are applied and finally, pg_repack swaps tables including indexes and drops the old, original table. This is quite simple and reliable process, but it has one caveat - extra space equals to an original table with its indexes plus some space for log table has to be reserved.

This is the main function of pg_repack, however, it has a few additional interesting features that allow you to:

  • process only indexes without parent table. 
  • move tables and indexes between tablespaces - it's helpful to balance IO or move archived partitions between fast SSD and slow HDD storages permanently for databases clients.
  • perform a non-blocking CLUSTER ordered by the specified columns.
  • speed up indexes build process with concurrent workers.
  • cancel queries if they block pg_repack.
So, pg_repack looks nice and as you can see pg_repack is speed oriented, so it should be remembered that its inaccurate usage might lead to extreme disk IO and drop client queries' performance.

Another community tool is pgcompacttable (also known as compactor) is another tool which helps to compress tables and indexes and achieve their optimal size.

Pgcompacttable works in different manner than pg_repack. It uses one interesting feature of PostgreSQL when it does INSERT or UPDATE operations - all new versions of rows go to available space at the beginning of the table. This is really important, because if starts updating all rows in the table from the end and moves backwards to the table beginning. Eventually all rows will fill all the available space and freed up space at the bottom of the table can be truncated by regular vacuum. Using special system columnctid it's possible to update rows and moving them from the end of table to the beginning. Thus pgcompacttable with batches of updates and vacuums forces PostgreSQL to move rows within tables and table finally becomes compacted.


Pgcompacttable has additional interesting feature that is worth mentioning, which somewhat intersects with pg_repack features:

  • ability to process only indexes without parent table.
  • ability to adjust performance impact caused by pgcompacttable
To conclude, while pg_repack may cause a temporary drop in the database performance, pgcompacttable is much safer option - it doesn't require reserved space for table maintenance (but still requires reserved space for index rebuild) and allows to compact table when it’s important to preserve clients’ queries performance.

What to choose? Here is a short checklist that might be handy for choosing your tool:
1. Does the database has performance reserves? Is it important to keep clients queries performance?
Yes - pgcompacttable, No - pg_repack.
2. Do I need to move tables or indexes between tablespaces?
Yes - pg_repack, No - pgcompacttable.
3. Do I have a free reserved space for maintenance?
Yes - pg_repack, No - pgcompacttable.

In any case, it is always a good idea to keep both tools on your server.

Comments

  1. Hi,

    Hwat's the difference between pgcompacttable and pgcompact from https://github.com/grayhemp/pgtoolkit ? How are they related ?

    Thanks,

    ReplyDelete
    Replies
    1. They both started from my perl script which I wrote for this task many 5 years ago based on idea found on https://www.depesz.com/2013/06/21/bloat-removal-by-tuples-moving/ .

      They using almost the same or the completely the same storable procedure for doing their job.

      So it's basically two independent fork of the same idea/initial code developed by different people. As a result of it - both utilities have slightly different feature and bug sets, but in simple cases they have almost same functionality.

      Delete
  2. I've been working with another relatively new non-blocking vacuuming tool, called pg_squeeze. It is much more integrated into PostgreSQL than the others. It uses background workers and logical decoding instead of triggers. It also does not require a cron to set and start jobs. It is integrated into a squeeze table. I like this best, next pgcompacttable and lastly, pg_repack. pg_repack is not getting prompt response on issues anymore. I created an issue on Jan 28 (issue 167), and no feedback from the maintainers yet!

    ReplyDelete
    Replies
    1. Hi, pg_squeeze looks interesting, going to check it out, thanks!

      Delete
  3. Can some one share the perf impacts of using these tools like pgcompacttable, pgcompac, pg_repack or pgsqueze. Also , is this tools supported for AWS Postgres RDS

    ReplyDelete
    Replies
    1. I'va not seen similar benchmarks, but it can be a topic for another post.
      About AWS RDS, I suppose only pgcompacttable is able to work with RDS, because pg_repack and pg_squeeze require to install their extensions into the database and RDS supports only a limited list of extensions.

      Delete
    2. pg_repack is supported in Amazon RDS in latest releases, AFAIK not without problems - see http://engineering.ometria.com/2018/01/29/using-pg_repack-with-aws-postgres-rds/. And since Amazon doesn't give superuser access to DB pgcompacttable probably will refuse to work because it need to change session_replication_role to ensure that triggers won't be fired during updates.

      Delete

Post a Comment

Popular Posts