Improve SQL parsing of character literals (quoted strings) #5108

erasmussen-first · 2023-10-23T20:32:16Z

Impact

Bug fix (non-breaking change which fixes expected existing functionality)
Enhancement/New feature (adds functionality without impacting existing logic)
Breaking change (fix or feature that would cause existing functionality to change)

Description

This fixes an SQL parsing issue where a certain character sequence inside a quoted character literal (aka quoted string) is incorrectly tokenized (premature end-of-string detection), leading to errors executing the SQL command(s) on the database host. The issue affects users of MySQL, Postgres, and possibly other DBs.

Also, a SQL character literal test pattern which fails without this update is added to the SimpleSqlGrammarTest.groovy to confirm the fix and prevent future reversion.

Things to be aware of

Per highlightjs/highlight.js#1748

Some SQL implementations support only quote-char-twice as an escaped-quote, others only support backslash-quote, some support both, and sometimes it is configurable.
This means it is not possible to always conclusively identify quoted text boundaries without context about SQL platform and configuration.
A real fix is possibly to invoke a block of code defined in SimpleSqlGrammar.jj to dynamically check for false-positive matches on S_CHAR_LITERAL.

A related complication is that '\' is sometimes a valid string literal, and sometimes it is not.

A similar issue in a different SQL parser using JavaCC is mentioned here: JSQLParser/JSqlParser#1172.

It was fixed, in theory, here: https://github.com/JSQLParser/JSqlParser/pull/1715/files#diff-d323df58a0300a038ac87b328bf05b8255ff06e6b5d0e9aeae641fa566e4068c
That fix may be overly specific to one SQL platform, but the core approach of adding a block of Java code (ala Lex) might be feasible.

Things to worry about

There are inconsistencies between different DB platforms in what is considered valid character literal syntax. Strings that are valid on one type of server may not be valid on another, and vice versa. This makes platform-agnostic end-of-string detection imprecise. Please test this very thoroughly on multiple DB platforms.

On the other hand, the current parser logic exposes MySQL, Postgres, and possibly other DBs to errors and/or unintended SQL command execution.

It may be that the truly correct fix involves detection of DB platform and configuration options.

Additional Context

The character pattern that this fixes was distilled out of changeSets that work correctly with MySQL 5.7 and Liquibase 3.5.4. I have not tested all the intervening releases to see exactly where it stopped working.

filipelautert · 2023-10-25T13:49:57Z

Hello @erasmussen-first ! Thanks for the PR.
We have one failed integration test . It's a mysql test, and it fails when running the following SQL:

CREATE PROCEDURE insert_shop ()
INSERT INTO shop 
VALUES
(1,'\'',3.45),
(1,'B',3.99),
(4,'\"',19.23),
(4,'\'\"',10.00)
;

CREATE FUNCTION f_insert_shop ()
RETURNS VARCHAR(20)
DETERMINISTIC
BEGIN
INSERT INTO shop 
VALUES
(5,'\'',3.45),
(5,'B',3.99),
(7,'\"',19.23),
(7,'\'\"',10.00);
RETURN ('STRING');
END
;

It is failing because it is not able to find the delimiter ";" to split the String into 2 SQLs, then it bundles it all together and we have a failure.
I managed to reproduce the issue adding the following line to the unit test file:

        "'a\'b;c\nd'"                                          | ["'a\'b;c\nd'"]

Notice that this is almost the same test that you added, but instead of using double \ I'm using just one. I tried some options but wasn't able to "fix" it.. any ideas?

erasmussen-first · 2023-10-25T14:54:14Z

Thanks. I suspect the \' and \" within the same string are not handled correctly. I'll try to adjust the patterns later today.

…e inside double quotes (escape isn't needed in these cases, but should be supported)

sonarcloud · 2023-10-25T21:01:44Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
No Duplication information

filipelautert · 2023-10-30T17:46:03Z

Hi @erasmussen-first ! thanks for the change, but it still not passing this check. I added the verification to the unit tests files so we are able to validate it using github build.

erasmussen-first · 2023-10-30T18:00:14Z

That's very strange. It seems to be working here with the pattern that I though you added in your comment. It may be that GitHub comments are turning a double backslash into a single backslash and I misinterpreted what you said. I'll merge in the latest master to make sure I'm not missing something else and try again with the new pattern added to the test.

…first/liquibase into improve-endDelimiter

…arsed test pattern

filipelautert

Functional tests passed here -> https://github.com/liquibase/liquibase-pro-tests/actions/runs/6825903760 .
@rberezen - this is a change to the heart of the
SQL parser, but @erasmussen-first did a great job and it is passing all of our tests.

rberezen · 2023-11-22T20:49:34Z

liquibase-standard/src/main/javacc/liquibase/util/grammar/SimpleSqlGrammar.jj

+|   < #ESC_NON_QUOTE: "\\" ["n","t","b","r","f","\\","0"] >
+
+    /* SQL-standard is that string literals are delimited only by single-quote, and double-quotes are only for identifiers... */
+//|   < #S_QUOTED_STRING_A: ( "'" ( <ESC_S_QUOTE_A> | <ESC_NON_QUOTE> | ~["'"] )* "'") >


@erasmussen-first Hi! I believe we should uncomment these few lines, right?

rberezen · 2023-11-22T20:50:41Z

liquibase-standard/src/main/javacc/liquibase/util/grammar/SimpleSqlGrammar.jj

+//|   < S_CHAR_LITERAL: (["U","E","N","R","B"]|"RB"|"_utf8")? ( <S_QUOTED_STRING_A> | <S_QUOTED_STRING_B> | <D_QUOTED_STRING_A> | <D_QUOTED_STRING_B> ) >
+|   < S_CHAR_LITERAL: (["U","E","N","R","B"]|"RB"|"_utf8")? (<S_QUOTED_STRING_HYBRID> | <D_QUOTED_STRING_HYBRID>) >
+
+// Previous logic...


@erasmussen-first I do not think we should keep the previous logic. Thanks!

@rberezen I think you mean remove those commented out lines?

@filipelautert yes, thank you ;)

Hi. I am happy to remove the comments. I had left them in to make it easy to compare the old and new patterns as well as to suggest ideas for further improvements. I will instead archive suggestions here in the comments and update the PR.

The patterns below may be of use in future work to more unambiguously determine end-of-string literal, but require the parser to know which syntax (A or B) applies to the database host. A possible solution for that is to implement optional parameters and default behavior in the <sql> and <sqlFile> tags. Trying all four patterns at once will produce mistakes, which is why this PR uses hybrid patterns.

/* SQL-standard is that string literals are delimited only by single-quote, and double-quotes are only for identifiers... */ | < #S_QUOTED_STRING_A: ( "'" ( <ESC_S_QUOTE_A> | <ESC_NON_QUOTE> | ~["'"] )* "'") > | < #S_QUOTED_STRING_B: ( "'" ( <ESC_S_QUOTE_A> | <ESC_S_QUOTE_B> | <ESC_D_QUOTE_B> | <ESC_NON_QUOTE> | ~["\\","'"] )* "'") > /* ... but many DBs tolerate double-quotes around string literals, including MySQL (unless you enable ANSI SQL mode), and MSSQL (if you disable SET QUOTED_IDENTIFIER) */ | < #D_QUOTED_STRING_A: ( "\"" ( <ESC_D_QUOTE_A> | <ESC_NON_QUOTE> | ~["\""] )* "\"") > | < #D_QUOTED_STRING_B: ( "\"" ( <ESC_S_QUOTE_B> | <ESC_D_QUOTE_A> | <ESC_D_QUOTE_B> | <ESC_NON_QUOTE> | ~["\\","\""] )* "\"") > /* Finally... (pick one based on DB host syntax) */ //| < S_CHAR_LITERAL: (["U","E","N","R","B"]|"RB"|"_utf8")? ( <S_QUOTED_STRING_A> | <D_QUOTED_STRING_A> ) > //| < S_CHAR_LITERAL: (["U","E","N","R","B"]|"RB"|"_utf8")? ( <S_QUOTED_STRING_B> | <D_QUOTED_STRING_B> ) >

@erasmussen-first thank you very much for your help and contribution!

filipelautert · 2023-11-28T12:43:32Z

Thanks @erasmussen-first !

erasmussen-first added 4 commits October 16, 2023 08:57

test to detect SQL parsing issue

6223075

fix error parsing string literals within SQL commands

5caae7e

Merge branch 'master' into improve-endDelimiter

5efc92f

Merge branch 'master' into improve-endDelimiter

a509634

erasmussen-first requested a review from filipelautert as a code owner October 23, 2023 20:32

filipelautert self-assigned this Oct 23, 2023

filipelautert added TypeBug SafeToBuild Indicates that a particular PR contains changes which are safe to build using GitHub actions labels Oct 23, 2023

allow escaped double-quote inside single quotes & escaped single-quot…

c8104d9

…e inside double quotes (escape isn't needed in these cases, but should be supported)

filipelautert added SafeToBuild Indicates that a particular PR contains changes which are safe to build using GitHub actions and removed SafeToBuild Indicates that a particular PR contains changes which are safe to build using GitHub actions labels Oct 25, 2023

Adding extra test

2f06034

filipelautert added SafeToBuild Indicates that a particular PR contains changes which are safe to build using GitHub actions and removed SafeToBuild Indicates that a particular PR contains changes which are safe to build using GitHub actions labels Oct 30, 2023

erasmussen-first added 5 commits October 30, 2023 19:30

Merge branch 'master' into improve-endDelimiter

071205a

Merge branch 'improve-endDelimiter' of https://github.com/erasmussen-…

ca70914

…first/liquibase into improve-endDelimiter

use double-backslash in Groovy test pattern for single-backslash in p…

83ad743

…arsed test pattern

more improvements to end-of-string detection, add more test patterns

0f75ae6

Merge branch 'master' into improve-endDelimiter

0b6204d

filipelautert added SafeToBuild Indicates that a particular PR contains changes which are safe to build using GitHub actions and removed SafeToBuild Indicates that a particular PR contains changes which are safe to build using GitHub actions labels Nov 9, 2023

filipelautert approved these changes Nov 10, 2023

View reviewed changes

filipelautert requested review from rberezen and suryaaki2 November 10, 2023 19:18

suryaaki2 approved these changes Nov 21, 2023

View reviewed changes

rberezen reviewed Nov 22, 2023

View reviewed changes

Remove commented-out code for improved readability

5272b2b

filipelautert requested a review from rberezen November 27, 2023 16:04

filipelautert added SafeToBuild Indicates that a particular PR contains changes which are safe to build using GitHub actions and removed SafeToBuild Indicates that a particular PR contains changes which are safe to build using GitHub actions labels Nov 27, 2023

filipelautert added this to the 1NEXT milestone Nov 27, 2023

Merge branch 'master' into improve-endDelimiter

519274b

rberezen added SafeToBuild Indicates that a particular PR contains changes which are safe to build using GitHub actions and removed SafeToBuild Indicates that a particular PR contains changes which are safe to build using GitHub actions labels Nov 27, 2023

rberezen approved these changes Nov 28, 2023

View reviewed changes

filipelautert merged commit dfb6c20 into liquibase:master Nov 28, 2023
38 of 42 checks passed

instagibb mentioned this pull request Jan 17, 2024

Lexical error in 4.25.1 for string containing backslash and a unicode character #5474

Closed

2 tasks

tati-qalified mentioned this pull request Jan 26, 2024

Error "Incorrect syntax near ‘GO’" on SQL Server when having strings terminating with a backslash and beginning with a backslash in one line #3687

Open

This was referenced Mar 8, 2024

BUG: Quoting not always identified properly, causes issues in end delimiter identification #5674

Closed

FIX: SimpleSQLGrammar quote parsing regression #5700

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve SQL parsing of character literals (quoted strings) #5108

Improve SQL parsing of character literals (quoted strings) #5108

erasmussen-first commented Oct 23, 2023 •

edited

filipelautert commented Oct 25, 2023

erasmussen-first commented Oct 25, 2023

sonarcloud bot commented Oct 25, 2023

filipelautert commented Oct 30, 2023

erasmussen-first commented Oct 30, 2023 •

edited

filipelautert left a comment

rberezen Nov 22, 2023

rberezen Nov 22, 2023

filipelautert Nov 23, 2023

rberezen Nov 23, 2023

erasmussen-first Nov 27, 2023 •

edited

rberezen Nov 27, 2023

filipelautert commented Nov 28, 2023

Improve SQL parsing of character literals (quoted strings) #5108

Improve SQL parsing of character literals (quoted strings) #5108

Conversation

erasmussen-first commented Oct 23, 2023 • edited

Impact

Description

Things to be aware of

Things to worry about

Additional Context

filipelautert commented Oct 25, 2023

erasmussen-first commented Oct 25, 2023

sonarcloud bot commented Oct 25, 2023

filipelautert commented Oct 30, 2023

erasmussen-first commented Oct 30, 2023 • edited

filipelautert left a comment

Choose a reason for hiding this comment

rberezen Nov 22, 2023

Choose a reason for hiding this comment

rberezen Nov 22, 2023

Choose a reason for hiding this comment

filipelautert Nov 23, 2023

Choose a reason for hiding this comment

rberezen Nov 23, 2023

Choose a reason for hiding this comment

erasmussen-first Nov 27, 2023 • edited

Choose a reason for hiding this comment

rberezen Nov 27, 2023

Choose a reason for hiding this comment

filipelautert commented Nov 28, 2023

erasmussen-first commented Oct 23, 2023 •

edited

erasmussen-first commented Oct 30, 2023 •

edited

erasmussen-first Nov 27, 2023 •

edited