Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support system tables #24972

Open
4 of 5 tasks
hiltontj opened this issue May 7, 2024 · 12 comments
Open
4 of 5 tasks

Support system tables #24972

hiltontj opened this issue May 7, 2024 · 12 comments
Assignees
Labels

Comments

@hiltontj
Copy link
Contributor

hiltontj commented May 7, 2024

This is a parent issue for adding support for system tables to influxdb3.

This is to provide system level debug information via queries such as those available in distributed versions of influxdb3:

  • system.compactor (only on pro): contains info on total number of LN files as well as their total size for each partition
  • system.queries: contains a log of queries performed, including various pieces of info and state indicators for each query
  • system.parquet_files (new): list parquet files for a given database and table

system.tables and system.partitions, which are included in distributed version of influxdb3, are not relevant in monolith mainly because there is no concept of partitions here.

It is worth noting that the SQL standard information_schema tables are already supported via DataFusion and can be accessed with queries like SHOW TABLES and SHOW COLUMNS FROM <table_name> in influxdb3 today.

Tasks

  1. v3
    hiltontj
  2. v3
    hiltontj
  3. v3
    hiltontj
  4. v3
    hiltontj
@hiltontj hiltontj added the v3 label May 7, 2024
@pauldix
Copy link
Member

pauldix commented May 7, 2024

Partitions doesn't make sense to bring over as OSS doesn't use the same scheme. We'll want to have a system table for listing parquet files I think.

@hiltontj
Copy link
Contributor Author

hiltontj commented May 7, 2024

I was wondering about that since partitions are really a distributed thing. It looks like we do fabricate a partition ID, but I guess that is mostly to satisfy the QueryChunk trait?

Either way, as long as there is no compatibility issues, I suppose we could have a system.parquet_files provider in lieu of system.partitions and system.tables (which seems useless in absence of system.partitions)

@pauldix
Copy link
Member

pauldix commented May 8, 2024

Yeah, the partition ID is just a stub to satisfy the trait.

For the system.parquet_files table, I think it would be good to limit the scope of what the user can query. Ideally, it would require a database name and table name so that only the parquet files for a single table would be returned. Otherwise, there's too much potential to return a massive amount of stuff. I'm not sure if there's a precedent for that anywhere but @alamb might have some thoughts on what would make sense for this.

@alamb
Copy link
Contributor

alamb commented May 8, 2024

For the system.parquet_files table, I think it would be good to limit the scope of what the user can query. Ideally, it would require a database name and table name so that only the parquet files for a single table would be returned. Otherwise, there's too much potential to return a massive amount of stuff. I'm not sure if there's a precedent for that anywhere but @alamb might have some thoughts on what would make sense for this.

Distributed doesn't enable system tables unless you set a special debug flag to avoid the potential issue of returning a massive amount of stuff.

One thing you could do is basically require a table_name = 'foo' and a databse_name = 'foo' predicate to be present

That would be implemented by

  1. looking at the _filter Exprs passed to the TableProvider::scan https://github.com/influxdata/influxdb_iox/blob/ca887899c73b57b710b371da1ca79fafced1831b/querier/src/system_tables/mod.rs#L145-L150
  2. Looking for those special predicates (special casing BinaryExpr)
  3. Erroring if they can't be cound with a message like "you must have a table_name='foo' predicate to query parquet_files" or something

@pauldix
Copy link
Member

pauldix commented May 8, 2024

@alamb I'm looking at the system tables as an API for users to find out information like schema, the parquet files that exist, and other information about the data itself. So it's ultimately something I'd like in both Monolith and Distributed for consistency. Are system tables the right place to put this? We could have a REST API to get at this information instead, but I thought that just having it in the query language might be easier for users.

@alamb
Copy link
Contributor

alamb commented May 8, 2024

@alamb I'm looking at the system tables as an API for users to find out information like schema, the parquet files that exist, and other information about the data itself. So it's ultimately something I'd like in both Monolith and Distributed for consistency.

Makes sense to me

Are system tables the right place to put this? We could have a REST API to get at this information instead, but I thought that just having it in the query language might be easier for users.

I think there are different opinions on this topic

I personally like system tables as then you have immediate filtering/joining/aggregation via SQL without any additional tools

However, I believe others prefer REST APIs so that it is easier to use other tools for analysis.

@hiltontj
Copy link
Contributor Author

hiltontj commented May 8, 2024

Having it in the query language would expose it via the /query API, which may not be as convenient as dedicated endpoints, but is still workable. We could then hook up dedicated REST endpoints if people desire them.

I guess one issue would be if the info we want returned is not tabular, then representing it in a SQL response may be problematic. In which case, REST would be better to begin with. That is not the case with the current system tables in iox though, and I don't think would be for listing out parquet file info.

@alamb
Copy link
Contributor

alamb commented May 9, 2024

I guess one issue would be if the info we want returned is not tabular, then representing it in a SQL response may be problematic. In which case, REST would be better to begin with.

That is a good point -- DataFusion / Rust can handle structured types (e.g StructArray, MapArray and ListArray) but the support is definitely not a full featured as it could be

@hiltontj
Copy link
Contributor Author

hiltontj commented May 9, 2024

To add some detail, the SHOW TABLES and SHOW COLUMNS queries, which are already supported, provide the following:

influxdb3 query -d foo 'SHOW TABLES'
+---------------+--------------------+-------------+------------+
| table_catalog | table_schema       | table_name  | table_type |
+---------------+--------------------+-------------+------------+
| public        | iox                | cpu         | BASE TABLE |
| public        | iox                | mem         | BASE TABLE |
| public        | information_schema | tables      | VIEW       |
| public        | information_schema | views       | VIEW       |
| public        | information_schema | columns     | VIEW       |
| public        | information_schema | df_settings | VIEW       |
| public        | information_schema | schemata    | VIEW       |
+---------------+--------------------+-------------+------------+
influxdb3 query -d foo 'SHOW COLUMNS FROM cpu'
+---------------+--------------+------------+-------------+-----------------------------+-------------+
| table_catalog | table_schema | table_name | column_name | data_type                   | is_nullable |
+---------------+--------------+------------+-------------+-----------------------------+-------------+
| public        | iox          | cpu        | host        | Dictionary(Int32, Utf8)     | YES         |
| public        | iox          | cpu        | time        | Timestamp(Nanosecond, None) | NO          |
| public        | iox          | cpu        | usage       | Float64                     | YES         |
+---------------+--------------+------------+-------------+-----------------------------+-------------+

@hiltontj
Copy link
Contributor Author

hiltontj commented May 9, 2024

DataFusion / Rust can handle structured types

Right, well in that case, I suppose we are covered there - structured type support will be improved in DF if we need it 😃

I would prefer to make it available via the query language first, because as @alamb mentioned it gives users a fair bit of capability out of the box. I think it is also easy for us to extend, and easy for users to adopt extensions, as they just write different SQL queries.

@alamb
Copy link
Contributor

alamb commented May 9, 2024

Right, well in that case, I suppose we are covered there - structured type support will be improved in DF if we need it 😃

❤️

@hiltontj hiltontj self-assigned this May 10, 2024
@hiltontj
Copy link
Contributor Author

There are PRs up to resolve all sub-issues above, I'll summarize them here, as it would be easier to review them in the following sequence:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants