Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: table-not-found issue with executeSelect while running long queries #2222

Merged
merged 14 commits into from Aug 23, 2022

Conversation

prash-mi
Copy link
Contributor

@prash-mi prash-mi commented Aug 12, 2022

Internal Bug's Ref: b/241134681 .

table-not-found issue has been observed with executeSelect while running long queries. This issue comes while initialising storage read session when the query job is not complete.

This is a short term fix where we are polling the job's status using jobs.getQueryResults and the session with read API is initialised when the job is complete, thus avoiding table-not-found.

Capturing this FR for the long term fix/re-design: #2240

Ref: go/executeselect-re-design

@prash-mi prash-mi requested review from a team and loferris August 12, 2022 09:47
@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/java-bigquery API. labels Aug 12, 2022
@prash-mi prash-mi added the owlbot:run Add this label to trigger the Owlbot post processor. label Aug 12, 2022
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Aug 12, 2022
@prash-mi prash-mi requested a review from a team as a code owner August 12, 2022 10:56
@prash-mi prash-mi added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Aug 19, 2022
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Aug 19, 2022
@prash-mi prash-mi added the owlbot:run Add this label to trigger the Owlbot post processor. label Aug 19, 2022
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Aug 19, 2022
// for getQueryResults) per iteration of the loop
long startTimeMs = System.currentTimeMillis();
long totalTimeOutMs = 18 * 60 * 60 * 1000; // 18 hours, which is the max timeout for the job
long poolingIntervalMs = 60000; // 1 min
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/pooling/polling/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for point it out, I have update it

// This logic will wait for approx (poolingIntervalMs + 10 seconds which is the default timeout
// for getQueryResults) per iteration of the loop
long startTimeMs = System.currentTimeMillis();
long totalTimeOutMs = 18 * 60 * 60 * 1000; // 18 hours, which is the max timeout for the job
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's reasonable to leave the totalTimeoutMs and assorted logic out of this PR. You may in the future want to add a user configurable timeout, but per-rpc timeouts are sufficient.

BQ guarantees that jobs make forward progress (a job won't get stuck in pending forever). This interval is so long you may as well just trust the system.

Copy link
Contributor Author

@prash-mi prash-mi Aug 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. I have updated the logic and have added a timeoutMs param at the RPC layer which is currently hardcoded to 60 seconds. QQ, in the very corner case when the job runs for 18 hours and is still not-complete then will the backend throw and error? Otherwise it will be an infinite loop as jobComplete will never be true.
Also, I have captured task to make timeoutMs user configurable here: #2240 (point#6)


} else { // wait for the defined poolingIntervalMs and the loop will retry
try {
Thread.sleep(poolingIntervalMs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior here seems wrong. getQueryResults will poll from the server side for up to 10 seconds, and then you sleep for 60 in the client? Effective latency for an 11 sec query becomes more than a minute.

You should not wait on the client side at all, just re-issue a subsequent getQueryResults call. Allow the poll and wait to happen within BQ.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, updated to logic to use RCP based poll and wait

@prash-mi prash-mi added the owlbot:run Add this label to trigger the Owlbot post processor. label Aug 22, 2022
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Aug 22, 2022
Copy link
Contributor

@shollyman shollyman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting all this together!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/java-bigquery API. size: m Pull request size is medium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants