Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: inconsistent bad query plan #6900

Open
phemmer opened this issue May 8, 2024 · 4 comments
Open

[Bug]: inconsistent bad query plan #6900

phemmer opened this issue May 8, 2024 · 4 comments

Comments

@phemmer
Copy link

phemmer commented May 8, 2024

What type of bug is this?

Performance issue

What subsystems and features are affected?

Query planner

What happened?

I've been experiencing a lot of queries lately that inconsistently perform extremely bad. As in they vary between sub-second and 5+ minutes. I don't know whether they're all the same, but I dug into one of them which I'll use for this report.

I've tested a few slight variations of the query and have managed to produce seemingly insightful results. The basis of this particular query's issue seems to revolve around the time filter. If I use an absolute time range, the query always performs fine. If I use a large relative time range (large meaning 1h, relative meaning using now()), the query always performs fine. If I use a small relative time range (small meaning 5m), the query is inconsistent (sometimes <1s, sometimes >5m).

Now I did run these on a live database, with data constantly changing, however after dozens of tries, the above information is always true. And the short+relative query can jump between good & bad from one second to the next. Sometimes it'll run good for only a few seconds before going bad again, and then sometimes it'll run good for many minutes.

So here's the queries (query + explain analyse output):

5m absolute time range

https://explain.dalibo.com/plan/677a6g12ec858cae#raw

1h relative time range

https://explain.dalibo.com/plan/3bafe032697843cc#raw

5m relative time range - bad plan

https://explain.dalibo.com/plan/c19g3a8fc569f4f1#raw

5m relative time range - good plan

https://explain.dalibo.com/plan/bgg3h91h6e9d6c26#raw

In the query examples, you can see that the good plans are all exactly the same, and the bad plan is substantially different. And even though the good plans involve parallel workers, that shouldn't explain the difference. We can see in the 5m+relative+good plan, that 7 workers were used. So to do it without parallelization, it should take <7x longer, but doesn't, and is instead 313x longer.

TimescaleDB version affected

2.14.2

PostgreSQL version used

16.2

What operating system did you use?

Debian 10

What installation method did you use?

Deb/Apt

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

No response

How can we reproduce the bug?

Unfortunately I do not know how to reproduce it in a clean environment. I'll provide the basic setup info below, but please feel free to request any additional information and I'll provide it.

First haproxy_server is a view defined as:

create view haproxy_server
            (time, tag_id, mode, pxname, svname, aggregate, host, instance, act, addr, agent_code, agent_desc, agent_duration, agent_fall, agent_health, agent_rise, agent_status, bck, bin, bout, check_desc, check_duration, check_fall,
             check_health, check_rise, check_status, chkdown, chkfail, cli_abrt, connect, cookie, ctime, ctime_max, downtime, dresp, econn, eint, eresp, hrsp_1xx, hrsp_2xx, hrsp_3xx, hrsp_4xx, hrsp_5xx, hrsp_other, idle_conn_cur, iid,
             last_agt, last_chk, lastchg, lastsess, lbtot, need_conn_est, qcur, qlimit, qmax, qtime, qtime_max, rate, rate_max, reuse, rtime, rtime_max, safe_conn_cur, scur, sid, slim, smax, src_ilim, srv_abrt, srv_icur, status, stot,
             tracked, ttime, ttime_max, used_conn_cur, weight, wredis, wretr, wrew)
as
SELECT t."time",
       t.tag_id,
       tt.mode,
       tt.pxname,
       tt.svname,
       tt.aggregate,
       tt.host,
       tt.instance,
       t.act,
       t.addr,
       t.agent_code,
       t.agent_desc,
       t.agent_duration,
       t.agent_fall,
       t.agent_health,
       t.agent_rise,
       t.agent_status,
       t.bck,
       t.bin,
       t.bout,
       t.check_desc,
       t.check_duration,
       t.check_fall,
       t.check_health,
       t.check_rise,
       t.check_status,
       t.chkdown,
       t.chkfail,
       t.cli_abrt,
       t.connect,
       t.cookie,
       t.ctime,
       t.ctime_max,
       t.downtime,
       t.dresp,
       t.econn,
       t.eint,
       t.eresp,
       t.hrsp_1xx,
       t.hrsp_2xx,
       t.hrsp_3xx,
       t.hrsp_4xx,
       t.hrsp_5xx,
       t.hrsp_other,
       t.idle_conn_cur,
       t.iid,
       t.last_agt,
       t.last_chk,
       t.lastchg,
       t.lastsess,
       t.lbtot,
       t.need_conn_est,
       t.qcur,
       t.qlimit,
       t.qmax,
       t.qtime,
       t.qtime_max,
       t.rate,
       t.rate_max,
       t.reuse,
       t.rtime,
       t.rtime_max,
       t.safe_conn_cur,
       t.scur,
       t.sid,
       t.slim,
       t.smax,
       t.src_ilim,
       t.srv_abrt,
       t.srv_icur,
       t.status,
       t.stot,
       t.tracked,
       t.ttime,
       t.ttime_max,
       t.used_conn_cur,
       t.weight,
       t.wredis,
       t.wretr,
       t.wrew
FROM telegraf.haproxy_server t,
     telegraf.haproxy_server_tag tt
WHERE t.tag_id = tt.tag_id;

telegraf.haproxy_server_tag is:

         Table "telegraf.haproxy_server_tag"
  Column   |  Type  | Collation | Nullable | Default 
-----------+--------+-----------+----------+---------
 tag_id    | bigint |           | not null | 
 aggregate | text   |           |          | 
 host      | text   |           |          | 
 instance  | text   |           |          | 
 mode      | text   |           |          | 
 pxname    | text   |           |          | 
 svname    | text   |           |          | 
Indexes:
    "haproxy_server_tag_pkey" PRIMARY KEY, btree (tag_id)

telegraf.haproxy_server is:

                        Table "telegraf.haproxy_server"
     Column     |            Type             | Collation | Nullable | Default 
----------------+-----------------------------+-----------+----------+---------
 time           | timestamp without time zone |           | not null | 
 tag_id         | bigint                      |           |          | 
 act            | bigint                      |           |          | 
 addr           | text                        |           |          | 
 agent_code     | bigint                      |           |          | 
 agent_duration | bigint                      |           |          | 
 agent_fall     | bigint                      |           |          | 
 agent_health   | bigint                      |           |          | 
 agent_rise     | bigint                      |           |          | 
 bck            | bigint                      |           |          | 
 bin            | bigint                      |           |          | 
 bout           | bigint                      |           |          | 
 check_desc     | text                        |           |          | 
 check_duration | bigint                      |           |          | 
 check_fall     | bigint                      |           |          | 
 check_health   | bigint                      |           |          | 
 check_rise     | bigint                      |           |          | 
 check_status   | text                        |           |          | 
 chkdown        | bigint                      |           |          | 
 chkfail        | bigint                      |           |          | 
 cli_abrt       | bigint                      |           |          | 
 connect        | bigint                      |           |          | 
 ctime          | bigint                      |           |          | 
 ctime_max      | bigint                      |           |          | 
 downtime       | bigint                      |           |          | 
 dresp          | bigint                      |           |          | 
 econn          | bigint                      |           |          | 
 eint           | bigint                      |           |          | 
 eresp          | bigint                      |           |          | 
 hrsp_1xx       | bigint                      |           |          | 
 hrsp_2xx       | bigint                      |           |          | 
 hrsp_3xx       | bigint                      |           |          | 
 hrsp_4xx       | bigint                      |           |          | 
 hrsp_5xx       | bigint                      |           |          | 
 hrsp_other     | bigint                      |           |          | 
 idle_conn_cur  | bigint                      |           |          | 
 iid            | bigint                      |           |          | 
 lastchg        | bigint                      |           |          | 
 lastsess       | bigint                      |           |          | 
 lbtot          | bigint                      |           |          | 
 need_conn_est  | bigint                      |           |          | 
 qcur           | bigint                      |           |          | 
 qlimit         | bigint                      |           |          | 
 qmax           | bigint                      |           |          | 
 qtime          | bigint                      |           |          | 
 qtime_max      | bigint                      |           |          | 
 rate           | bigint                      |           |          | 
 rate_max       | bigint                      |           |          | 
 reuse          | bigint                      |           |          | 
 rtime          | bigint                      |           |          | 
 rtime_max      | bigint                      |           |          | 
 safe_conn_cur  | bigint                      |           |          | 
 scur           | bigint                      |           |          | 
 sid            | bigint                      |           |          | 
 smax           | bigint                      |           |          | 
 src_ilim       | bigint                      |           |          | 
 srv_abrt       | bigint                      |           |          | 
 srv_icur       | bigint                      |           |          | 
 status         | text                        |           |          | 
 stot           | bigint                      |           |          | 
 ttime          | bigint                      |           |          | 
 ttime_max      | bigint                      |           |          | 
 used_conn_cur  | bigint                      |           |          | 
 weight         | bigint                      |           |          | 
 wredis         | bigint                      |           |          | 
 wretr          | bigint                      |           |          | 
 wrew           | bigint                      |           |          | 
 agent_desc     | text                        |           |          | 
 agent_status   | text                        |           |          | 
 cookie         | text                        |           |          | 
 last_agt       | text                        |           |          | 
 last_chk       | text                        |           |          | 
 tracked        | text                        |           |          | 
 slim           | bigint                      |           |          | 
Indexes:
    "haproxy_server_time_idx" btree ("time" DESC)
Triggers:
    ts_insert_blocker BEFORE INSERT ON telegraf.haproxy_server FOR EACH ROW EXECUTE FUNCTION _timescaledb_functions.insert_blocker()
Number of child tables: 181 (Use \d+ to list them.)


# select * from timescaledb_information.hypertables where hypertable_name = 'haproxy_server';
 hypertable_schema | hypertable_name |   owner   | num_dimensions | num_chunks | compression_enabled | tablespaces 
-------------------+-----------------+-----------+----------------+------------+---------------------+-------------
 telegraf          | haproxy_server  | edgestats |              1 |        181 | t                   | 

# select * from timescaledb_information.compression_settings where hypertable_name = 'haproxy_server';
 hypertable_schema | hypertable_name | attname | segmentby_column_index | orderby_column_index | orderby_asc | orderby_nullsfirst 
-------------------+-----------------+---------+------------------------+----------------------+-------------+--------------------
 telegraf          | haproxy_server  | tag_id  |                      1 |                      |             | 
 telegraf          | haproxy_server  | time    |                        |                    1 | f           | t


# select * from timescaledb_information.dimensions where hypertable_name = 'haproxy_server';
 hypertable_schema | hypertable_name | dimension_number | column_name |         column_type         | dimension_type | time_interval | integer_interval | integer_now_func | num_partitions 
-------------------+-----------------+------------------+-------------+-----------------------------+----------------+---------------+------------------+------------------+----------------
 telegraf          | haproxy_server  |                1 | time        | timestamp without time zone | Time           | 02:00:00      |                  |                  |               

@phemmer phemmer added the bug label May 8, 2024
@svenklemm
Copy link
Member

The difference seems to be Nested Loop Join vs Hash Join.
Did you intentionally choose to use timestamp without timezone here? This introduces casts in this query which make the chunk exclusion not work at plan time but only at execution time. So the amount of chunks during planning is very different from execution. Switching to timestamp with timezone would probably solve your problem. Another workaround would be to disable nested loop join for this query with set enable_nestloop TO false;

@phemmer
Copy link
Author

phemmer commented May 9, 2024

Did you intentionally choose to use timestamp without timezone here?

Yes. To eliminate variables, I wanted the data type to be consistent between the relative and absolute queries. In addition the underlying data type on the table is also timestamp without timezone.

This introduces casts in this query

Why would this be the case? If both the query and the table are using the same data type, why would there be casting involved?

Switching to timestamp with timezone would probably solve your problem.

It did not. https://explain.dalibo.com/plan/949a6301d7ch487e#raw

Another workaround would be to disable nested loop join

That appears to have done it (although now its deciding to play nice for a while even without that). And it appears that grafana (the main source of the queries) does allow transactions via explicit BEGIN; END; so SET LOCAL can be used to apply it on an as-needed basis. But yes, just a workaround, as I don't want to have to rely on this.

@svenklemm
Copy link
Member

Did you intentionally choose to use timestamp without timezone here?

Yes. To eliminate variables, I wanted the data type to be consistent between the relative and absolute queries. In addition the underlying data type on the table is also timestamp without timezone.

Usage of timestamp without timezone is discouraged in general in postgres. See https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use_timestamp_.28without_time_zone.29_to_store_UTC_times

This introduces casts in this query

Why would this be the case? If both the query and the table are using the same data type, why would there be casting involved?

now() returns timestamp with timezone so any constraint with now() will involve datatype conversions

Switching to timestamp with timezone would probably solve your problem.

It did not. https://explain.dalibo.com/plan/949a6301d7ch487e#raw

Did you change the table definition as well. With the datatype being timestamp with timezone i would expect to move the chunk exclusion from execution time to planning time which should be a big improvement to planning time as well.

The line Chunks excluded during startup: 170 should disappear as that is the exclusion happening during executor startup.

@phemmer
Copy link
Author

phemmer commented May 18, 2024

now() returns timestamp with timezone so any constraint with now() will involve datatype conversions

That's why I had now()::timestamp so that it would convert once. now() is declared as "stable" (in postgres nomenclature), so it will always return the same result within a given statement. Thus it should only be getting converted once.

Did you change the table definition as well. With the datatype being timestamp with timezone i would expect to move the chunk exclusion from execution time to planning time which should be a big improvement to planning time as well.

No, that's a massive undertaking, as that involves changing a ton of data. However for the purposes of this issue, I've created a new table using timestamptz. Though I now have to wait for it to fill back up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants