Better failure reporting from database queue #119

jonyoder · 2023-08-01T19:10:58Z

I'm curious for your thoughts on #119 and #118.

We recently encountered heavy load of EXCLUSIVE table locks on PPM. We have this complicated workflow with completing a queue job. Basically, it boils down, to something like this:

When the queue agent encounters the end of a job, and if the job is for addressed work (typically, work for the cache), then
Call a method to finalize/complete the work, which
Grabs an exclusive lock. Clears any existing failure records for that address in the queue_failure table. If the job ended in error, insert a new record into queue_failure.
After that, a notification goes out to all nodes that indicates that work for address is done.

Meanwhile, the node that requested the asset from the cache is running a polling loop. This loop:

At the start, preemptively checks the cache to see if the asset is already here. If so, it simply returns.
Waits for one of the following:
a. A notification saying that work with address is complete. When that's received, it polls for the asset again.
b. A notification saying that chunks of the work for address are completed. When that's received, it doesn't poll again, but simply returns so we can start serving the chunks.
c. A time.Ticker to fire. We added this just in case we never get the completion notification. In this case, we poll again.

In the above polling loop, each time we check the cache for the asset, we:

Check the queue_failure table to see if a failure is recorded for the work. If so, return the error.
If no error, return the asset.

I feel like our notification mechanism is robust and proven enough that we may be able to eliminate all of this complication, including the queue_failure table. The PRs above update the "work complete" notification to also include the error if the work ended in error. If a node misses the notification, it'll still periodically poll and pick up the asset if it gets created without an error (same as before). The only difference will be if the work completes in error, but the notification isn't received. In that case, an actor that's waiting on the asset may poll and find that the work is completed, but the asset won't be found. We already handle this scenario with an error that reports something like "the queue reported that x work is complete, but the item was not found in the cache".

Better failure reporting from database queue

1df4a9f

jonyoder force-pushed the jon-queue-failure-db branch from e9d7991 to 1df4a9f Compare August 1, 2023 19:28

jonyoder mentioned this pull request Aug 1, 2023

Better failure reporting from queue #118

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better failure reporting from database queue #119

Better failure reporting from database queue #119

jonyoder commented Aug 1, 2023 •

edited

Better failure reporting from database queue #119

Are you sure you want to change the base?

Better failure reporting from database queue #119

Conversation

jonyoder commented Aug 1, 2023 • edited

jonyoder commented Aug 1, 2023 •

edited