Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for pg_stat_replication.(write_lag|flush_lag|replay_lag) #1007

Open
spootze opened this issue Feb 26, 2024 · 2 comments
Open

Add support for pg_stat_replication.(write_lag|flush_lag|replay_lag) #1007

spootze opened this issue Feb 26, 2024 · 2 comments

Comments

@spootze
Copy link

spootze commented Feb 26, 2024

Add support for pg_stat_replication.(write_lag|flush_lag|replay_lag)

Use case. Why is this important?

For simple primary:replica setups it would be convenient to be able to monitor standby lag in seconds from the primary. pg_stat_replication already includes this information on postgres>=10, but the exporter does not parse it. AFAIK, workarounds would include

  • monitoring byte offset from pg_stat_replication.write_lsn|flush_lsn|replay_lsn), but this does not quite capture issues where the replica has not replayed a business critical change however small in bytes.
  • monitoring delay on the replica, but this would require spinning up a separate monitor only to monitor a single value on the replica.

Notes

I'm assuming the exporter does not support parsing the interval data type, which is why these metrics are marked as DISCARD here. I wonder if the following, explicit approach would be appropriate in this case:

diff --git a/cmd/postgres_exporter/queries.go b/cmd/postgres_exporter/queries.go
index fa0b5c2..e28d7b4 100644
--- a/cmd/postgres_exporter/queries.go
+++ b/cmd/postgres_exporter/queries.go
@@ -53,7 +53,10 @@ var queryOverrides = map[string][]OverrideQuery{
 			SELECT *,
 				(case pg_is_in_recovery() when 't' then null else pg_current_wal_lsn() end) AS pg_current_wal_lsn,
 				(case pg_is_in_recovery() when 't' then null else pg_wal_lsn_diff(pg_current_wal_lsn(), pg_lsn('0/0'))::float end) AS pg_current_wal_lsn_bytes,
-				(case pg_is_in_recovery() when 't' then null else pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)::float end) AS pg_wal_lsn_diff
+				(case pg_is_in_recovery() when 't' then null else pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)::float end) AS pg_wal_lsn_diff,
+				(case pg_is_in_recovery() when 't' then null else  extract(seconds from write_lag) end) as write_lag_seconds,
+				(case pg_is_in_recovery() when 't' then null else  extract(seconds from flush_lag) end) as flush_lag_seconds,
+				(case pg_is_in_recovery() when 't' then null else  extract(seconds from replay_lag) end) as replay_lag_seconds
 			FROM pg_stat_replication
 			`,
 		},

That way the deviation from the official fields is not in conflict with the source data.

@CarpathianUA
Copy link

Also interested in this, it would be great to have a replay_lag

@longtomjr
Copy link

Just a note for anyone looking into this. epoch should be extracted from the interval instead of seconds, since extracting seconds will only extract the seconds "component" of the interval:

				(case pg_is_in_recovery() when 't' then null else  extract(epoch from write_lag) end) as write_lag_seconds,
				(case pg_is_in_recovery() when 't' then null else  extract(epoch from flush_lag) end) as flush_lag_seconds,
				(case pg_is_in_recovery() when 't' then null else  extract(epoch from replay_lag) end) as replay_lag_seconds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants