Support system tables #24972

hiltontj · 2024-05-07T19:55:31Z

This is a parent issue for adding support for system tables to influxdb3.

This is to provide system level debug information via queries such as those available in distributed versions of influxdb3:

system.compactor (only on pro): contains info on total number of L_N files as well as their total size for each partition
system.queries: contains a log of queries performed, including various pieces of info and state indicators for each query
system.parquet_files (new): list parquet files for a given database and table

system.tables and system.partitions, which are included in distributed version of influxdb3, are not relevant in monolith mainly because there is no concept of partitions here.

It is worth noting that the SQL standard information_schema tables are already supported via DataFusion and can be accessed with queries like SHOW TABLES and SHOW COLUMNS FROM <table_name> in influxdb3 today.

Tasks

Give feedback

Move traits such as IoxSystemTable and SystemTableProvider and any other shareable code from iox into influxdb3_core (chore: latest core sync influxdb3_core#18)
Tie system tables queries into the QueryDatabase in the query_executor using core traits/types #24986

v3
Create the system.queries provider #24987

v3
Make system tables available via a debug option in the HTTP API and influxdb3 client #24989

v3
Create the system.parquet_files provider #24988

v3
Options

The text was updated successfully, but these errors were encountered:

pauldix · 2024-05-07T20:46:21Z

Partitions doesn't make sense to bring over as OSS doesn't use the same scheme. We'll want to have a system table for listing parquet files I think.

hiltontj · 2024-05-07T20:55:54Z

I was wondering about that since partitions are really a distributed thing. It looks like we do fabricate a partition ID, but I guess that is mostly to satisfy the QueryChunk trait?

Either way, as long as there is no compatibility issues, I suppose we could have a system.parquet_files provider in lieu of system.partitions and system.tables (which seems useless in absence of system.partitions)

pauldix · 2024-05-08T16:50:13Z

Yeah, the partition ID is just a stub to satisfy the trait.

For the system.parquet_files table, I think it would be good to limit the scope of what the user can query. Ideally, it would require a database name and table name so that only the parquet files for a single table would be returned. Otherwise, there's too much potential to return a massive amount of stuff. I'm not sure if there's a precedent for that anywhere but @alamb might have some thoughts on what would make sense for this.

alamb · 2024-05-08T17:37:50Z

For the system.parquet_files table, I think it would be good to limit the scope of what the user can query. Ideally, it would require a database name and table name so that only the parquet files for a single table would be returned. Otherwise, there's too much potential to return a massive amount of stuff. I'm not sure if there's a precedent for that anywhere but @alamb might have some thoughts on what would make sense for this.

Distributed doesn't enable system tables unless you set a special debug flag to avoid the potential issue of returning a massive amount of stuff.

One thing you could do is basically require a table_name = 'foo' and a databse_name = 'foo' predicate to be present

That would be implemented by

looking at the _filter Exprs passed to the TableProvider::scan https://github.com/influxdata/influxdb_iox/blob/ca887899c73b57b710b371da1ca79fafced1831b/querier/src/system_tables/mod.rs#L145-L150
Looking for those special predicates (special casing BinaryExpr)
Erroring if they can't be cound with a message like "you must have a table_name='foo' predicate to query parquet_files" or something

pauldix · 2024-05-08T17:44:53Z

@alamb I'm looking at the system tables as an API for users to find out information like schema, the parquet files that exist, and other information about the data itself. So it's ultimately something I'd like in both Monolith and Distributed for consistency. Are system tables the right place to put this? We could have a REST API to get at this information instead, but I thought that just having it in the query language might be easier for users.

alamb · 2024-05-08T19:11:25Z

@alamb I'm looking at the system tables as an API for users to find out information like schema, the parquet files that exist, and other information about the data itself. So it's ultimately something I'd like in both Monolith and Distributed for consistency.

Makes sense to me

Are system tables the right place to put this? We could have a REST API to get at this information instead, but I thought that just having it in the query language might be easier for users.

I think there are different opinions on this topic

I personally like system tables as then you have immediate filtering/joining/aggregation via SQL without any additional tools

However, I believe others prefer REST APIs so that it is easier to use other tools for analysis.

hiltontj · 2024-05-08T23:30:55Z

Having it in the query language would expose it via the /query API, which may not be as convenient as dedicated endpoints, but is still workable. We could then hook up dedicated REST endpoints if people desire them.

I guess one issue would be if the info we want returned is not tabular, then representing it in a SQL response may be problematic. In which case, REST would be better to begin with. That is not the case with the current system tables in iox though, and I don't think would be for listing out parquet file info.

alamb · 2024-05-09T08:43:27Z

I guess one issue would be if the info we want returned is not tabular, then representing it in a SQL response may be problematic. In which case, REST would be better to begin with.

That is a good point -- DataFusion / Rust can handle structured types (e.g StructArray, MapArray and ListArray) but the support is definitely not a full featured as it could be

hiltontj · 2024-05-09T13:06:10Z

To add some detail, the SHOW TABLES and SHOW COLUMNS queries, which are already supported, provide the following:

influxdb3 query -d foo 'SHOW TABLES'

+---------------+--------------------+-------------+------------+
| table_catalog | table_schema       | table_name  | table_type |
+---------------+--------------------+-------------+------------+
| public        | iox                | cpu         | BASE TABLE |
| public        | iox                | mem         | BASE TABLE |
| public        | information_schema | tables      | VIEW       |
| public        | information_schema | views       | VIEW       |
| public        | information_schema | columns     | VIEW       |
| public        | information_schema | df_settings | VIEW       |
| public        | information_schema | schemata    | VIEW       |
+---------------+--------------------+-------------+------------+

influxdb3 query -d foo 'SHOW COLUMNS FROM cpu'

+---------------+--------------+------------+-------------+-----------------------------+-------------+
| table_catalog | table_schema | table_name | column_name | data_type                   | is_nullable |
+---------------+--------------+------------+-------------+-----------------------------+-------------+
| public        | iox          | cpu        | host        | Dictionary(Int32, Utf8)     | YES         |
| public        | iox          | cpu        | time        | Timestamp(Nanosecond, None) | NO          |
| public        | iox          | cpu        | usage       | Float64                     | YES         |
+---------------+--------------+------------+-------------+-----------------------------+-------------+

hiltontj · 2024-05-09T13:29:51Z

DataFusion / Rust can handle structured types

Right, well in that case, I suppose we are covered there - structured type support will be improved in DF if we need it 😃

I would prefer to make it available via the query language first, because as @alamb mentioned it gives users a fair bit of capability out of the box. I think it is also easy for us to extend, and easy for users to adopt extensions, as they just write different SQL queries.

alamb · 2024-05-09T14:02:50Z

Right, well in that case, I suppose we are covered there - structured type support will be improved in DF if we need it 😃

❤️

hiltontj · 2024-05-16T13:28:53Z

There are PRs up to resolve all sub-issues above, I'll summarize them here, as it would be easier to review them in the following sequence:

feat: add SystemSchemaProvider to QueryExecutor #24990 - Provides the framework for providing system._ tables to the query executor, and elevates the QueryLog
feat: add the system.queries table #24992 - Adds support for the system.queries table, which provides access via SQL queries to the QueryLog on the query executor
refactor: expose system tables by default in edge/pro #25000 - makes it so queries to the system.* tables are available without the need for a debug header/parameter in gRPC/HTTP requests
feat: support system.parquet_files table #25002 - Adds support for the system.parquet_files table, which provides access via SQL queries to see parquet files associated with a given table/database

hiltontj added the v3 label May 7, 2024

hiltontj self-assigned this May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support system tables #24972

Support system tables #24972

hiltontj commented May 7, 2024 •

edited

Tasks

pauldix commented May 7, 2024

hiltontj commented May 7, 2024

pauldix commented May 8, 2024

alamb commented May 8, 2024

pauldix commented May 8, 2024

alamb commented May 8, 2024 •

edited

hiltontj commented May 8, 2024

alamb commented May 9, 2024

hiltontj commented May 9, 2024

hiltontj commented May 9, 2024

alamb commented May 9, 2024

hiltontj commented May 16, 2024

Support system tables #24972

Support system tables #24972

Comments

hiltontj commented May 7, 2024 • edited

Tasks

pauldix commented May 7, 2024

hiltontj commented May 7, 2024

pauldix commented May 8, 2024

alamb commented May 8, 2024

pauldix commented May 8, 2024

alamb commented May 8, 2024 • edited

hiltontj commented May 8, 2024

alamb commented May 9, 2024

hiltontj commented May 9, 2024

hiltontj commented May 9, 2024

alamb commented May 9, 2024

hiltontj commented May 16, 2024

hiltontj commented May 7, 2024 •

edited

alamb commented May 8, 2024 •

edited