Introduction to pg_stat_activity.

PostgreSQL RDBMS has a very informative built-in tools to track the state of the database. One such tool is pg_stat_activity. This is a system view that allows to monitor the databases processes in real time. This view is comparable to the system command top, it is one of the first places from which database administrator can start an investigation if some problem occurs. In this post I explain some useful examples how this view can be used to detect abnormal activity. To start, we need just some postgresql client, psql for example.

As a first step you can use an easy query:
# SELECT * FROM pg_stat_activity;

It is not important which database you are connected to, pg_stat_activity is shared among databases.
pg_stat_activity shows single line for the each database connection. There is a dedicated UNIX process for each connection. For one connection there are several attributes:
pid - process ID which handles the connection. In the PostgreSQL terminology this process is called backend or worker.
datid, datname - ID and name of the database to which client connects.
usesysid, usename - ID and name which was used to connect.
client_addr, client_port, client_hostname - network settings of the client, network address, port and hostname.
application_name - an arbitrary name that can be specified when client connects or a session varialble.
backend_start, xact_start, query_start, state_change - timestamps, which indicate when the process (backend), transaction or query within a transaction has been started, and the last time the state of the process was changed.
state, waiting - state of the process, and a flag which indicates if the process is waiting for another process.
backend_xid, backend_xmin - running transaction ID (xid) and transaction ID (xmin) which determines visibility scope for currently running transaction.
query - text of the query which currently runs by a worker or was recently executed in this connection.

Since each line describe a single connection, we can easily find out how many client connections established to the database.
# SELECT count(*) AS total_conns FROM pg_stat_activity;
total_conns
-------------
61


Next, we can use the WHERE clause and add filter options. For example, we can see how many connections established from the host with address 10.0.20.26.
# SELECT count(*) FROM pg_stat_activity WHERE client_addr = '10.0.20.26';
count
-------
10


Of course, we can use different conditions and filter other fields, including and combining them with AND and OR.
Very important attributes are the transaction and query start time. Using the current time, we can calculate the duration of a transactions and queries. This is very useful for the long transaction detection.
# SELECT
client_addr, usename, datname,
clock_timestamp() - xact_start AS xact_age,
clock_timestamp() - query_start AS query_age,
state, query
FROM pg_stat_activity
ORDER BY coalesce(xact_start, query_start);

-[ RECORD 1 ]-------------------------------------------------------
client_addr | 10.0.20.26
usename     | james
datname     | sales_db
xact_age    |
query_age   | 12 days 05:52:09.181345
state       | idle
query       | <query text>


As we can see in the output, we have a request which runs for 12 days. This is not an active transaction because xact_age field is empty. Using state field we can figure out the connection state - currently it's idle connection. Most likely James executed some query at Friday, not disconnected from the database and went on vacation for two weeks.
As mentioned above, in this example, we use a state field. Let us examine this in more details. The state field determines the current status of the connection and can be in one of several states:
active - this operating state of the process means that the process executes the query, that is, doing a useful work.
idle - is idling, no useful work is being done.
idle in transaction - is idling in an open transaction. This means that the application opened the transaction and does nothing anymore. Long running transactions (more than one hour) with such status are harmful to the database and should be forced to close, and the causes of such behavior at the application level should be removed.
idle in transaction (aborted) - is a broken transaction,at least one of the queries inside the transaction was failed, and other queries will be ignored, until transaction aborts.
fastpath function call - the backend is executing a fast-path function.
disabled - this is a dummy state, it is shown only if the track_activities option is disabled.
Now we have an idea of ​​the connection statuses, we can see the total activity in the database.
# SELECT
client_addr, usename, datname, state, count(*)
FROM pg_stat_activity
GROUP BY 1, 2, 3, 4 ORDER BY 5 DESC;
client_addr  | usename  | datname  |        state        | count
-------------+----------+----------+---------------------+-------
127.0.0.1    | app_user | sales_db | idle                | 28
127.0.0.1    | app_user | sales_db | active              | 15
127.0.0.1    | app_user | sales_db | idle in transaction | 3
127.0.0.1    | bg_user  | sales_db | active              | 6
10.11.2.12   | james    | sales_db | idle                | 2
10.0.20.26   | helen    | shop_db  | active              | 1


Note, in example above the most of the connections are idle. If the total number of idle connections is dozens or several hundreds, then you definitely need to start thinking towards using pgbouncer, to reduce the number of idle processes. We also can see idle in transaction processes, it is important to monitor them and close them using pg_terminate_backend() if they hang too long. To determine the age of such connections is not a big problem, because we already know how to do this.
# SELECT
client_addr, usename, datname,
clock_timestamp() - xact_start AS xact_age,
clock_timestamp() - query_start AS query_age,
state, query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY coalesce(xact_start,query_start);

client_addr | usename  | datname  |    xact_age     |    query_age    |        state        |   query
------------+----------+----------+-----------------+-----------------+---------------------+-------------
127.0.0.1   | app_user | sales_db | 00:00:06.001629 | 00:00:00.002542 | idle in transaction | <query text>
127.0.0.1   | app_user | sales_db | 00:00:05.006710 | 00:00:00.003561 | idle in transaction | <query text>
127.0.0.1   | app_user | sales_db | 00:00:00.009004 | 00:00:00.001629 | idle in transaction | <query text>


As we can see the age of the transaction is relatively small, so we can filter the short queries. Also we filter idle connections, because they are not so interesting. Let's filter out everything which is faster than 10 seconds.
# SELECT
client_addr, usename, datname,
now() - xact_start AS xact_age,
now() - query_start AS query_age,
state, query
FROM pg_stat_activity
WHERE (
(now() - xact_start) > '00:00:10'::interval OR
(now() - query_start) > '00:00:10'::interval AND
state <> 'idle'
)
ORDER BY coalesce(xact_start, query_start);
client_addr | usename  | datname  |    xact_age     |    query_age    |        state        |   query
------------+----------+----------+-----------------+-----------------+---------------------+-------------
127.0.0.1   | app_user | sales_db | 00:00:12.013319 | 00:00:05.002151 | active              | <query text>
127.0.0.1   | app_user | sales_db | 00:00:10.083718 | 00:00:10.083718 | idle in transaction | <query text>


But that's not the end, there can be situations when a query is blocked by another request or transaction. To identify such connections the waiting attribute can be used (true if disabled, false if no locks).
# SELECT
client_addr, usename, datname,
now() - xact_start AS xact_age,
now() - query_start AS query_age,
state, waiting, query
FROM pg_stat_activity
WHERE waiting
ORDER BY coalesce(xact_start, query_start);
client_addr | usename  | datname  |    xact_age     |    query_age    | state  | waiting |   query
------------+----------+----------+-----------------+-----------------+--------+---------+-------------
127.0.0.1   | app_user | sales_db | 00:00:16.736127 | 00:00:02.839100 | active | t       | <query text>


Presence of waiting processes is a bad sign, it is usually an evidence of poor designed application. Locking occurs in a situation where two or more concurrent transactions (or queries) try to access the same resource, such as a set of rows of a table. A simple example, transaction A update set of rows M, transaction B when tries to update the same set of rows and will wait until the transaction A will be either commited or aborted.
Besides pg_stat_activity, such situations can be tracked down in the postgresql log, in case when log_lock_waits is enabled.

[UPDATE waiting] LOG: process 29054 still waiting for ShareLock on transaction 2034635 after 1000.160 ms
[UPDATE waiting] DETAIL: Process holding the lock: 29030. Wait queue: 29054.
[UPDATE waiting] CONTEXT: while updating tuple (0,68) in relation "products"
[UPDATE waiting] STATEMENT: update products set price = 20 where id = 1;
[UPDATE waiting] LOG: process 29054 acquired ShareLock on transaction 2034635 after 9049.000 ms
[UPDATE waiting] CONTEXT: while updating tuple (0,68) in relation "products"
[UPDATE waiting] STATEMENT: update products set price = 20 where id = 1;


What's going on here?
  1. The process with PID 29054 is blocked and is waiting. Process with PID 29030 holds the lock.
  2. The text of the waiting query was logged too.
  3. After the 9 seconds through the process with PID 29054 obtain resources and was able to complete the query.
A deadlock situation is also possible. That happens when to commit some transaction PostgreSQL needs to acquire a lock on the resource, currently blocked by another transaction, which can release the lock only after obtaining another lock on the resource currently blocked by the first transaction. When this situation occurs, postgres deadlock detection mechanism terminates one of the transactions and that allows another transactions to proceed. Default deadlock timeout is 1 second and can be configured through deadlock_timeout. Such situations are also logged in the postgresql log.

[UPDATE] ERROR: deadlock detected
[UPDATE] DETAIL: Process 29054 waits for ShareLock on transaction 2034637; blocked by process 29030.
Process 29030 waits for ShareLock on transaction 2034638; blocked by process 29054.
Process 29054: update products set price = 20 where id = 1;
Process 29030: update products set price = 20 where id = 2;
[UPDATE] HINT: See server log for query details.
[UPDATE] CONTEXT: while updating tuple (0,68) in relation "products"
[UPDATE] STATEMENT: update products set price = 20 where id = 1;
[UPDATE waiting] LOG: process 29030 acquired ShareLock on transaction 2034638 after 2924.914 ms
[UPDATE waiting] CONTEXT: while updating tuple (0,69) in relation "products"
[UPDATE waiting] STATEMENT: update products set price = 20 where id = 2;


what was happened here?
  1. The process with PID 29054 waits, as it was blocked by a process with PID 29030.
  2. In turn, the process with PID 29030 waits too, as it was blocked by a process with PID 29054.
  3. We can see further queries because of which deadlock was occured.
  4. The process with PID 29030 was unlocked, so query within the process 29054 has been forced to cancel and transaction moved to "idle in transaction (aborted)" state.
Anyway, the waiting processes and deadlocks are bad, such incidents must be investigated and must be eliminated. Maybe someone will ask: Why they are so bad? I will answer, while the process is blocked, the application which sent the query also waits, in this case the end user may think that the system slows down and can be frustrated. Nobody likes long response time.
To wrap it up, now you know how to use pg_stat_activity effectively, and can create your own view. It will be the first mate when something in the database goes out of control. Here is an example of a view which shows abnormal activity in the database.

# CREATE VIEW db_activity AS
SELECT
pid, client_addr, client_port,
datname, usename,
clock_timestamp() - pg_stat_activity.xact_start AS ts_age,
clock_timestamp() - pg_stat_activity.query_start AS query_age,
clock_timestamp() - pg_stat_activity.state_change AS change_age,
waiting, state, query
FROM pg_stat_activity
WHERE (
(clock_timestamp() - xact_start) > '00:00:00.1'::interval OR
(clock_timestamp() - query_start) > '00:00:00.1'::interval
AND state IN ('idle in transaction (aborted)', 'idle in transaction')
) AND pid <> pg_backend_pid()
ORDER BY COALESCE (xact_start, query_start);

That is all. Thanks!

Comments

  1. I think you probably want to change "transacaction" to "transaction" in your db_activity view.

    ReplyDelete

Post a Comment

Popular Posts