Skip to content

Delimited Identifiers

Martin Traverso edited this page May 4, 2021 · 6 revisions

The intent of this document is to capture the high-level thoughts and ideas about how to add support for case sensitive identifiers.

Specification

<delimited identifier> ::=
  <double quote> <delimited identifier body> <double quote>

<delimited identifier body> ::=  <delimited identifier part>...
<delimited identifier part> ::=
    <nondoublequote character>
  | <doublequote symbol>

<Unicode delimited identifier> ::=
  U <ampersand> <double quote> <Unicode delimiter body> <double quote>
      <Unicode escape specifier>
<Unicode escape specifier> ::=
  [ UESCAPE <quote> <Unicode escape character> <quote> ]
<Unicode delimiter body> ::=
  <Unicode identifier part>...
<Unicode identifier part> ::=
    <delimited identifier part>
  | <Unicode escape value>
24) For every <identifier body> IB there is exactly one corresponding case-normal form CNF. CNF is an <identifier body> derived from IB as follows:
Let n be the number of characters in IB. For i ranging from 1 (one) to n, the i-th character Mi of IB is transliterated into the corresponding character
or characters of CNF as follows:
Case:
   a) If Mi is a lower case character or a title case character for which an equivalent upper case sequence U is de ned by Unicode, then let j be th
       e number of characters in U; the next j characters of CNF are U.
   b) Otherwise, the next character of CNF is Mi.
25) The case-normal form of the <identifier body> of a <regular identifier> is used for purposes such as and including determination of identifier
      equivalence, representation in the Definition and Information Schemas, and representation in diagnostics areas.

...

27) Two <regular identifier>s are equivalent if the case-normal forms of their <identifier body>s, considered as the repetition of a <character string literal>
that specifies a <character set specification> of SQL_IDENTIFIER and an implementation-defined collation IDC that is sensitive to case, compare equally
according to the comparison rules in Subclause 8.2, “<comparison predicate>”.

28) A <regular identifier> and a <delimited identifier> are equivalent if the case-normal form of the <identifier body> of the <regular identifier> and the
<delimited identifier body> of the <delimited identifier> (with all occurrences of <quote> replaced by <quote symbol> and all occurrences of
<doublequote symbol> replaced by <double quote>), considered as the repetition of a <character string literal> that specifies a <character set specification>
 of SQL_IDENTIFIER and IDC, compare equally according to the comparison rules in Subclause 8.2, “<comparison predicate>”.


29) Two<delimited identifier>s are equivalent if their <delimited identifierbody>s,considered as the repetition of a <character string literal> that specifies
 a <character set specification> of SQL_IDENTIFIER and an implementation-defined collation that is sensitive to case, compare equally according to the
 comparison rules in Subclause 8.2, “<comparison predicate>”.

30) Two <Unicode delimited identifier>s are equivalent if their <Unicode delimiter body>s, considered as the repetition of a <character string literal> that
 specifies a <character set specification> of SQL_IDENTIFIER and an implementation-defined collation that is sensitive to case, compare equally according
 to the comparison rules in Subclause 8.2, “<comparison predicate>”.

31) A <Unicode delimited identifier> and a <delimited identifier> are equivalent if their <Unicode delimiter body> and <delimited identifier body>,
respectively, each considered as the repetition of a <character string literal> that specifies a <character set specification> of SQL_IDENTIFIER and
an implementation-defined collation that is sensitive to case, compare equally according to the comparison rules in Subclause 8.2, “<comparison predicate>”.

32) A <regular identifier> and a <Unicode delimited identifier> are equivalent if the case-normal form of the <identifier body> of the <regular identifier>
and the <Unicode delimiter body> of the <Unicode delimited identifier> considered as the repetition of a <character string literal>, each specifying a
 <character set specification> of SQL_IDENTIFIER and an implementation-defined collation that is sensitive to case, compare equally according to the
comparison rules in Subclause 8.2, “<comparison predicate>”.

In a nutshell, when comparing identifiers, if any of the identifiers is non-delimited it needs to be canonicalized (upper case) and then compared using regular string comparison operations.

The book "SQL: 1999: Understanding Relational Language Components" by Jim Melton, one of the editors of the SQL spec says:

In effect, SQL:1999 changes all lowercase letters in regular identifiers to their uppercase-equivalent letters. This is especially important for identifiers that are stored as data values in the views of the Information Schema. [...] When a delimited identifier is stored into the views of the Information Schema, the double quotes are not stored, but all the other characters are, just as they appeared in the delimited identifier. Therefore, the regular identifier TITLES and the delimited identifier "TITLES" are stored identically and are therefore completely equivalent.

Goals

Trino currently does case-insensitive matching of identifiers. The goal of this project is to bring Trino closer to the SQL standard by doing SQL-compatible identifier matching wherever possible and to improve interoperability with storage engines that are not SQL-compliant. Ideally, the change should be made in such a way to minimize backward-compatibility breaking changes to the SPI.

Considerations

Connectors can be classified in one of these categories:

  • SQL compliant (e.g., Oracle, DB2)
    • Delimited identifiers are significant
    • Normalizes to upper case
  • Non-compliant
    • Case-sensitive (E.g., PostgreSQL)
      • Delimited identifiers are significant
      • Normalizes to something other than upper-case
    • Case-insensitive (e.g., Hive)
      • Delimited identifiers behave as un-delimited
      • May normalize identifiers

Trino interacts with identifiers in a few different ways:

  • Matching for the purpose of looking up entities (catalogs, schemas, tables, views, users, roles, session properties)
    • Columns are a special case, since they are resolved by the engine based on table metadata provided by connectors
  • Providing identifiers to connectors when creating entities
  • Presenting and processing entity names returned by connectors, such as in column resolution or any of the metadata introspection commands (SHOW, DESCRIBE, etc).

Approach

Proposal

New API in ConnectorMetadata:

/**
 * Canonicalizes the provided SQL identifier according to connector-specific rules
 * for the purpose of providing the name in metadata APIs
 */
default String canonicalize(String identifier, boolean delimited) {
    // default implementation normalizes according to standard SQL rules
    ...
}

The default implementation follows standard SQL rules for canonicalizing identifiers, but can be overridden by connectors that don't follow standard semantics. The engine calls this API to convert identifiers present in a SQL query into names that can be interpreted by connectors.

Note: we can consider making the default implementation normalize using current rules (i.e., identifier.toLowerCase(ENGLISH)) to preserve backward compatibility, but that comes at the expense of additional work for SQL-compatible connectors and supporting a behavior we don't want in the long term).

As an example, for the following query:

SELECT * FROM some_table

The syntax tree contains an Identifier("some_table", NON-DELIMITED) for the table reference. During analysis, the engine calls connectorMetadata.canonicalize("some_table", false) to obtain SOME_TABLE in the case of a SQL-compliant connector. This name is then used to fetch table metadata via connectorMetadata.getTableHandle().

Other principles:

  • All names returned by connectors (via INFORMATION_SCHEMA tables, table and column metadata, etc), are considered already canonicalized per to connector rules.
  • The approach does not require changing existing APIs. Any connector APIs that take names continue to use strings, but the values that are passed in are pre-canonicalize using the API. In a way, the canonicalization API is equivalent to modifying all the existing APIs to take identifier-like objects. The only difference is whether the connector has flexibility to pick different canonicalization rules for each API call

Use cases

  • Resolving a table name

    • SQL identifier is canonicalized via connector.canonicalize()
    • Canonicalized name is used to do a lookup against connector
  • Resolving a column name

    • Column name from SQL query is canonicalized via connector.canonicalize()
    • Lookup is performed against table metadata
    • Question: how do we evolve the analyzer to do this?
  • Resolving a ROW field

  • Resolving a user or role

  • Resolving a type

  • SHOW CREATE ...

  • INFORMATION_SCHEMA tables

(in progress)

  • How does normalization play into column and row field resolution? (tentatively: they are matched in the engine using standard SQL rules)
  • Do names returned from metadata APIs get normalized in any way (tentatively: no)

Other approaches

Previous (partial) ideas and why they are not appropriate or sufficient.

Normalize all identifiers within the engine using standard SQL rules

Connectors are not aware of the notion of delimited vs non-delimited identifiers. All they see is identifiers after they've been normalized. It can be trivially shown why this is problematic. Consider the following query against a table named some_table in PostgreSQL:

SELECT * FROM some_table (...)

Depending on whether the connector uses delimited identifiers in the synthesized query it sends to the PostgreSQL instance, a few behaviors are possible. If it delimits identifiers, the query will be:

-- PostgreSQL query
SELECT * FROM "SOME_TABLE"

Since the name of the table in the PostgreSQL database is some_table, the identifier won't match and the query will fail.

If it uses non-delimited identifiers, the query will be:

-- PostgreSQL query
SELECT * FROM SOME_TABLE

Since PostgreSQL normalizes names to lower-case, the SOME_TABLE identifier will match against the database table some_table. It would seem like the answer is for the connector to not use delimited identifiers against the backend PostgreSQL instance, but that would be a mistake, as it would become impossible to query any table containing mixed case characters.

This leaves us with requiring users to delimit identifiers when querying PostgreSQL tables, which is cumbersome and confusing:

SELECT * FROM "some_table" (...)

Introduce an Identifier-like metadata object to communicate whether a name is delimited or not in every object resolution API.

This is conceptually equivalent to the approach above, but it pushes any required normalization into every connector API. The change would have a large surface area and would require considerable work to adapt connectors to the new APIs.