Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cql: fix hang during certain SELECT statements #18694

Closed
wants to merge 1 commit into from

Conversation

nyh
Copy link
Contributor

@nyh nyh commented May 15, 2024

The function intersection(r1,r2) in statement_restrictions.cc is used when several WHERE restrictions were applied to the same column. For example, for "WHERE b<1 AND b<2" the intersection of the two ranges is calculated to be b<1.

As noted in issue #18690, Scylla is inconsistent in where it allows or doesn't allow these intersecting restrictions. But where they are allowed they must be implemented correctly. And it turns out the function intersection() had a bug that caused it to sometimes enter an infinite loop - when the intent was only to call itself once with swapped parameters.

This patch includes a test reproducing this bug, and a fix for the bug. The test hangs before the fix, and passes after the fix.

While at it, I carefully reviewed the entire code used to implement the intersection() function to try to make sure that the bug we found was the only one. I also added a few more comments where I thought they were needed to understand complicated logic of the code.

The bug, the fix and the test were originally discovered by Michał Chojnowski.

Fixes #18688
Refs #18690

This patch should be backported to all extant branches: It is a bug that affected an actual user; The hang is a serious problem (in some sense it's even worse than a crash because we can't recover from it, while after a crash we restart Scylla), and this bug can even be used to DoS-attack Scylla by sending a CQL request with a maliciously-crafted set of restrictions.

The function intersection(r1,r2) in statement_restrictions.cc is used
when several WHERE restrictions were applied to the same column.
For example, for "WHERE b<1 AND b<2" the intersection of the two ranges
is calculated to be b<1.

As noted in issue scylladb#18690, Scylla is inconsistent in where it allows or
doesn't allow these intersecting restrictions. But where they are
allowed they must be implemented correctly. And it turns out the
function intersection() had a bug that caused it to sometimes enter
an infinite loop - when the intent was only to call itself once with
swapped parameters.

This patch includes a test reproducing this bug, and a fix for the
bug. The test hangs before the fix, and passes after the fix.

While at it, I carefully reviewed the entire code used to implement
the intersection() function to try to make sure that the bug we found
was the only one. I also added a few more comments where I thought they
were needed to understand complicated logic of the code.

The bug, the fix and the test were originally discovered by
Michał Chojnowski.

Fixes scylladb#18688
Refs scylladb#18690

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
@nyh nyh added backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed backport/6.0 labels May 15, 2024
@nyh nyh requested review from michoecho and ptrsmrn May 15, 2024 18:51
@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Unit Tests Custom
The following new/updated tests ran 100 times for each mode:
🔹 cql-pytest/test_restrictions
✅ - Container Test
✅ - dtest
✅ - dtest with topology changes
✅ - Unit Tests

Build Details:

  • Duration: 6 hr 10 min
  • Builder: i-0669ff7c6b0920dea (m5ad.8xlarge)

@nyh
Copy link
Contributor Author

nyh commented May 16, 2024

@yaronkaikov I see above a failure for "PR Tasks Completed Check / task-check (pull_request_target)". What is this about? Can explain these tests better and/or remove broken tests, to reduce the "boy who cried wolf" problem (where I see the CI tests failing, and I don't worry, because I don't believe it anyway, and next thing you know we start merging code with failing tests).

@yaronkaikov
Copy link
Contributor

The function intersection(r1,r2) in statement_restrictions.cc is used when several WHERE restrictions were applied to the same column. For example, for "WHERE b<1 AND b<2" the intersection of the two ranges is calculated to be b<1.

As noted in issue #18690, Scylla is inconsistent in where it allows or doesn't allow these intersecting restrictions. But where they are allowed they must be implemented correctly. And it turns out the function intersection() had a bug that caused it to sometimes enter an infinite loop - when the intent was only to call itself once with swapped parameters.

This patch includes a test reproducing this bug, and a fix for the bug. The test hangs before the fix, and passes after the fix.

While at it, I carefully reviewed the entire code used to implement the intersection() function to try to make sure that the bug we found was the only one. I also added a few more comments where I thought they were needed to understand complicated logic of the code.

The bug, the fix and the test were originally discovered by Michał Chojnowski.

Fixes #18688 Refs #18690

  • ** Backport reason (please explain below if this patch should be backported or not) **

The failure is about this checkbox, you show check it and explain why this patch requires backport

@mykaul
Copy link
Contributor

mykaul commented May 16, 2024

The function intersection(r1,r2) in statement_restrictions.cc is used when several WHERE restrictions were applied to the same column. For example, for "WHERE b<1 AND b<2" the intersection of the two ranges is calculated to be b<1.
As noted in issue #18690, Scylla is inconsistent in where it allows or doesn't allow these intersecting restrictions. But where they are allowed they must be implemented correctly. And it turns out the function intersection() had a bug that caused it to sometimes enter an infinite loop - when the intent was only to call itself once with swapped parameters.
This patch includes a test reproducing this bug, and a fix for the bug. The test hangs before the fix, and passes after the fix.
While at it, I carefully reviewed the entire code used to implement the intersection() function to try to make sure that the bug we found was the only one. I also added a few more comments where I thought they were needed to understand complicated logic of the code.
The bug, the fix and the test were originally discovered by Michał Chojnowski.
Fixes #18688 Refs #18690

  • ** Backport reason (please explain below if this patch should be backported or not) **

The failure is about this checkbox, you show check it and explain why this patch requires backport

We need to add wording such as 'Mandatory' around that checkbox.

@nyh
Copy link
Contributor Author

nyh commented May 16, 2024

The failure is about this checkbox, you show check it and explain why this patch requires backport

How should I have guessed this?
Even worse, how should I guess what I need to do with this checkbox? I honestly don't understand... Do I need to "check" it (always? just if I want a backport)? Do I need to replace the garbage text by something or write some text following (this is what the text says...) after it?

Moreover, this is a PR labeled bug (I guess some bot copied it from the issue), so why do I need more justification why it should be backported? Don't we want to backport all bug-fixes unless it's too hard (and to know that, I would need to start the backport - and this isn't the time for it).

@nyh
Copy link
Contributor Author

nyh commented May 16, 2024

We need to add wording such as 'Mandatory' around that checkbox.

In my opinion we need to get rid of this checkbox and the bot that uses it. Nobody understand what it means (at least I don't, see my questions above) and I don't believe it helps anyone - not reviewers and not developers.

@yaronkaikov
Copy link
Contributor

The failure is about this checkbox, you show check it and explain why this patch requires backport

How should I have guessed this? Even worse, how should I guess what I need to do with this checkbox? I honestly don't understand... Do I need to "check" it (always? just if I want a backport)? Do I need to replace the garbage text by something or write some text following (this is what the text says...) after it?

I will make it more clear, will send a fix shortly

Moreover, this is a PR labeled bug (I guess some bot copied it from the issue), so why do I need more justification why it should be backported? Don't we want to backport all bug-fixes unless it's too hard (and to know that, I would need to start the backport - and this isn't the time for it).

Because this is what was requested by @avikivity , labeling a PR with a relevant backport label doesn't explain why should we backport or not.

@nyh
Copy link
Contributor Author

nyh commented May 16, 2024

Because this is what was requested by @avikivity , labeling a PR with a relevant backport label doesn't explain why should we backport or not.

Maybe @avikivity can explain what he requested, but I have absolutely no idea what this "checkbox" expected me to do, and still don't. A paragraph about backports is now necessary in every PR? Where should this paragraph be written - below the checkbox? Instead of it? Do I need to "check" that checkbox? Do I need to delete it so that garbage won't be saved to the git history? I can't figure this out.

@nyh
Copy link
Contributor Author

nyh commented May 16, 2024

Maybe @avikivity can explain what he requested, but I have absolutely no idea what this "checkbox" expected me to do

To make a concrete request: If this checkbox thing exists because @avikivity requested it, then please replace the text by a long and self-explanatory text explaining exactly what the developer needs to do and where - remove the explanation or let it be? click the checkbox? write new text where? etc. - and what will happen with it. The current explanation text is not self-explanatory at all and I didn't understand what it asked me to do.

@avikivity
Copy link
Member

Because this is what was requested by @avikivity , labeling a PR with a relevant backport label doesn't explain why should we backport or not.

Maybe @avikivity can explain what he requested, but I have absolutely no idea what this "checkbox" expected me to do, and still don't. A paragraph about backports is now necessary in every PR? Where should this paragraph be written - below the checkbox? Instead of it? Do I need to "check" that checkbox? Do I need to delete it so that garbage won't be saved to the git history? I can't figure this out.

I didn't ask for a checkbox (and I failed to understand it in exactly the same way @nyh) did.

I asked for a tag, and for the tag to carry over to the git changelog.

@yaronkaikov
Copy link
Contributor

Because this is what was requested by @avikivity , labeling a PR with a relevant backport label doesn't explain why should we backport or not.

Maybe @avikivity can explain what he requested, but I have absolutely no idea what this "checkbox" expected me to do, and still don't. A paragraph about backports is now necessary in every PR? Where should this paragraph be written - below the checkbox? Instead of it? Do I need to "check" that checkbox? Do I need to delete it so that garbage won't be saved to the git history? I can't figure this out.

I didn't ask for a checkbox (and I failed to understand it in exactly the same way @nyh) did.

I asked for a tag, and for the tag to carry over to the git changelog.

The checkbox is just for a reminder in case someone forgot to explain. we can drop it, and make sure the maintainers will follow this and make sure if the explanation is good enough for each PR

@avikivity
Copy link
Member

Because this is what was requested by @avikivity , labeling a PR with a relevant backport label doesn't explain why should we backport or not.

Maybe @avikivity can explain what he requested, but I have absolutely no idea what this "checkbox" expected me to do, and still don't. A paragraph about backports is now necessary in every PR? Where should this paragraph be written - below the checkbox? Instead of it? Do I need to "check" that checkbox? Do I need to delete it so that garbage won't be saved to the git history? I can't figure this out.

I didn't ask for a checkbox (and I failed to understand it in exactly the same way @nyh) did.
I asked for a tag, and for the tag to carry over to the git changelog.

The checkbox is just for a reminder in case someone forgot to explain. we can drop it, and make sure the maintainers will follow this and make sure if the explanation is good enough for each PR

I want machine enforcement. But here it's just enforcing that people check the checkbox. It's pure paperwork.

@nyh
Copy link
Contributor Author

nyh commented May 16, 2024

The checkbox is just for a reminder in case someone forgot to explain.

Maybe I'm being dense, but I still don't understand.
Let's say I want to explain. How do I do that? Should I remove the "reminder" and just write text? Do I write text after the reminder and leave the reminder untouched? Or maybe I am supposed to "check" the checkbox and not just leave it untouched?

What does the bot (which failed in my PR) actually check? Does it expect me to have checked the checkbox, expect me to have removed it, or what?

@avikivity
Copy link
Member

@nyh the checkbox is there to remind you to check the checkbox

@yaronkaikov
Copy link
Contributor

@nyh the checkbox is there to remind you to check the checkbox

The reason for the checkbox is to fail one of the PR checks so people will be aware something is missing

@nyh
Copy link
Contributor Author

nyh commented May 16, 2024

The reason for the checkbox is to fail one of the PR checks so people will be aware something is missing

Yes, I was aware "something is missing". But I had no idea what. I didn't even know the problem was related to that checkbox until you told me (I didn't know what "task-check" even meant). I still don't know what exactly is missing :-) Is what missing that I need to keep the silly checkbox line and click it? Or am I allowed to delete the checkbox line (instead of clicking on it) so it won't appear in the git log? I'll try now and see what happens.

@nyh
Copy link
Contributor Author

nyh commented May 16, 2024

I'm happy to report that by deleting the checkbox line (and I also added a paragraph mentioning backports), the bot was appeased. Apparently you are not expected to check the box - you are expected to delete it.

@michoecho
Copy link
Contributor

What does the bot (which failed in my PR) actually check?

@nyh It's not a secret.

You can find a link to the bot ("action") here (the "Details" link):

image

The link takes you here:

https://github.com/scylladb/scylladb/actions/runs/9101152077/job/25017639556

You can find the source of the bot on the left, in the "Workload file":

https://github.com/scylladb/scylladb/actions/runs/9101152077/workflow

There you can see that it calls chromaui/task-completed-checker-action@main, which is a just a script that checks the PR text for unchecked checkboxes. Specifically, it calls this script:

https://github.com/chromaui/task-completed-checker-action/blob/main/src/main.ts

And there you can find the details:

const isTaskCompleted = cleanText.match(/(- \[[ ]\].+)/g) === null;

So, to sum up, your cover letter is not allowed to have any matches with the regex - \[[ ]\].+.

@nyh
Copy link
Contributor Author

nyh commented May 16, 2024

What does the bot (which failed in my PR) actually check?

@nyh It's not a secret.

You can find a link to the bot ("action") here (the "Details" link):

image

The link takes you here:

https://github.com/scylladb/scylladb/actions/runs/9101152077/job/25017639556

You can find the source of the bot on the left, in the "Workload file":

https://github.com/scylladb/scylladb/actions/runs/9101152077/workflow

There you can see that it calls chromaui/task-completed-checker-action@main, which is a just a script that checks the PR text for unchecked checkboxes. Specifically, it calls this script:

https://github.com/chromaui/task-completed-checker-action/blob/main/src/main.ts

And there you can find the details:

const isTaskCompleted = cleanText.match(/(- \[[ ]\].+)/g) === null;

So, to sum up, your cover letter is not allowed to have any matches with the regex - \[[ ]\].+.

Thanks. This is more-or-less what I discovered by experimentation - that just removing that silly checkbox (instead of checking it) works.

Why are we using some random project's (chromeui) 20-line script in our project? One day they can change their script or remove it, and cause us a big mess :-(

@yaronkaikov
Copy link
Contributor

What does the bot (which failed in my PR) actually check?

@nyh It's not a secret.
You can find a link to the bot ("action") here (the "Details" link):
image
The link takes you here:
https://github.com/scylladb/scylladb/actions/runs/9101152077/job/25017639556
You can find the source of the bot on the left, in the "Workload file":
https://github.com/scylladb/scylladb/actions/runs/9101152077/workflow
There you can see that it calls chromaui/task-completed-checker-action@main, which is a just a script that checks the PR text for unchecked checkboxes. Specifically, it calls this script:
https://github.com/chromaui/task-completed-checker-action/blob/main/src/main.ts
And there you can find the details:

const isTaskCompleted = cleanText.match(/(- \[[ ]\].+)/g) === null;

So, to sum up, your cover letter is not allowed to have any matches with the regex - \[[ ]\].+.

Thanks. This is more-or-less what I discovered by experimentation - that just removing that silly checkbox (instead of checking it) works.

Why are we using some random project's (chromeui) 20-line script in our project? One day they can change their script or remove it, and cause us a big mess :-(

Sending a patch to remove it, the check and checkbox. will leave the PR template, hoping everyone will remember to add/explain the reason for backporting or not

@mykaul
Copy link
Contributor

mykaul commented May 16, 2024

Sending a patch to remove it, the check and checkbox. will leave the PR template, hoping everyone will remember to add/explain the reason for backporting or not

buried in a random comment in an endless thread of comments.

@yaronkaikov
Copy link
Contributor

Sending a patch to remove it, the check and checkbox. will leave the PR template, hoping everyone will remember to add/explain the reason for backporting or not

buried in a random comment in an endless thread of comments.

#18708

@mykaul
Copy link
Contributor

mykaul commented May 26, 2024

This was merged to master before the branch to 6.0, so no need to backport to 6.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cql backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed bug promoted-to-master
Projects
None yet
Development

Successfully merging this pull request may close these issues.

clustering_range intersection() can cause an infinite loop
7 participants