Workaround failing deadlock when running backfill #32991

potiuk · 2023-08-01T00:48:11Z

The dask_executor backfill tests started to fail recently more often due
to backfill exception, and the likely cause for it is that it is now
better parallelise execution and triggering of the deadlocks because of
contention betwee dag_run state update and task state update had
become much easier.

While this PR does not fix the underlying issue, it catches the
operational error where the deadlock occured during the backfill.
and rolls back the operation.

This should be safe. backfil has a built-in mechanism to loop and
retry failed tasks and the test passed multiple times, completing the
backfill after this fix was applied. It was not easy to reproduce it
locally but it failed every 20-30 times. When extra logging was added,
it was always connected to OperationalException raised (and caught)
right after _per_task_process. The same exception was observed few times
when rollback was added, and despite it backfill job retried and
completed the process successfully every time. We also leave the logs
with exceptions and add reassuring messages that should make it clear
that in case backfill completes, the exceptions can be ignored as
the updates will be retried by the backfill job.

Fixes: #32778

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

potiuk · 2023-08-06T01:07:16Z

Hey everyone - I am not very proud of this PR, but I am cautiously optimisting it is likely workrarounding the curesed DeadLock that started to show up with test_backfill_job of DaskExecutor.

I believe it was a real issue, not a test problem. And I think workaround I came up with should also handle similar "production" issues when they happen.

It is not dissimilar to other similar fixes for backfill job causing deadlocks (you can see similar workarounds in other places).

Generally speaking - i was able to reproduce it locally (happens once every 30-50 times or so) so it is definitely not a "side-effect" of another test- I run this one individually with --with-db-init flag and still (with difficulty) I was able to produce deadlock. When I added some debugging it turned out that the deadlock in "commit()" in the outer loop was actually caused by a deadlock exception raised in the inner loop - right after _per_task_process - then the exception was propagated up and then the finally clause in line ~ 960 attempted to commit it again:

        finally:
            session.commit()
            executor.end()

So we have not seen the original place where the deadlock occured - because we have not logged that OperationalError in the inner loop and the deadlock was reported again when the outer process (not even loop) tried to run the commit for already deadlocked session - at the end of the backfil_job.

It looks like the deadlock is cause by the lack of "for update" on Dag Run in the backfill job, while the backfill job runs "for update" on task instance. So if scheduler happens to want to update task_instance and keeps the dag_run "for update", the backfill job tries to update.

BTW. Possibly this is an inherent problem with the backfill job. Maybe this is expected really and maybe the workaround is the best what we can do ?

I gathered server-side logging where I could see both sides of the deadlock:

	Process 11417 waits for ShareLock on transaction 70151; blocked by process 11337.

        # !!!! This is what backfill job does !!!

	Process 11337: UPDATE dag_run SET last_scheduling_decision=NULL, updated_at='2023-08-05T23:07:39.040007+00:00'::timestamptz WHERE dag_run.id = 1

        # !!!!! This is what I presume Scheduler does !!!! 

	Process 11417: UPDATE task_instance SET start_date='2023-08-05T23:07:42.249709+00:00'::timestamptz, end_date='2023-08-05T23:07:42.249709+00:00'::timestamptz, duration=0.0, state='skipped', updated_at='2023-08-05T23:07:42.253235+00:00'::timestamptz WHERE task_instance.dag_id = 'example_bash_operator' AND task_instance.task_id = 'run_this_last' AND task_instance.run_id = 'backfill__2017-01-01T00:00:00+00:00' AND task_instance.map_index =  -1

I have few more examples of server logs where I was able to reproduce it locally https://gist.github.com/potiuk/6bd2d855a01ed3e8877a30cefcec01b3

I added a bit more logging and I specifically call rollback() in the inner loop - without re-raising the error. This should not be a problem because backfill job runs in a loop and it will again pick up the task next time. So the most harm that might happen is that the task that caused DeadLock will get slightly delayed.

It was far more often replicable on CI (often 1 or 2 postgres jobs failed with it) so I will now close/reopen that PR few times tomorrow to see if I can make the deadlock happen in CI with this fix.

I'd love to hear more from those who are better experts than me on this one. Maybe collectively we can solve the puzzle better than just via workaround.

potiuk · 2023-08-06T12:06:48Z

So far I run it ~ 7 times with full tests needed, not a single backfill_job deadlock occured.

potiuk · 2023-08-06T14:30:52Z

Happened once: https://github.com/apache/airflow/actions/runs/5776982503/job/15656723055?pr=32991

3 repeats might be not enough ?

potiuk · 2023-08-08T08:34:54Z

How about we merge that one?

looks stable (increased max_attempts to 5)
unquarantines the failing backfill job
has better user experience than "failing with deadllock" when apparently the deadlock is recoverable

airflow/jobs/backfill_job_runner.py

The dask_executor backfill tests started to fail recently more often due to backfill exception, and the likely cause for it is that it is now better parallelise execution and triggering of the deadlocks because of contention betwee dag_run state update and task state update had become much easier. While this PR does not fix the underlying issue, it catches the operational error where the deadlock occured during the backfill. and rolls back the operation. This **should** be safe. backfil has a built-in mechanism to loop and retry failed tasks and the test passed multiple times, completing the backfill after this fix was applied. It was not easy to reproduce it locally but it failed every 20-30 times. When extra logging was added, it was always connected to OperationalException raised (and caught) right after _per_task_process. The same exception was observed few times when rollback was added, and despite it backfill job retried and completed the process successfully every time. We also leave the logs with exceptions and add reassuring messages that should make it clear that in case backfill completes, the exceptions can be ignored as the updates will be retried by the backfill job. Fixes: apache#32778

uranusjr · 2023-08-09T07:35:18Z

Any particualr reason this doesn’t use @retry_db_transaction?

potiuk · 2023-08-09T07:44:17Z

Any particualr reason this doesn’t use @retry_db_transaction?

The only reason is that I added those retries incrementally, trying different things. But yeah. goodl idea, I can actually extract the inner code into a method and decorate it - that would be way cleaner. let me do it.

potiuk · 2023-08-09T08:01:40Z

Any particualr reason this doesn’t use @retry_db_transaction?

The only reason is that I added those retries incrementally, trying different things. But yeah. goodl idea, I can actually extract the inner code into a method and decorate it - that would be way cleaner. let me do it.

Looked at - it would make it awfully inconsistent. the Backfill Job is in a VERY BAD need of refactoring. and it already ises the pattern of try/except/loop and if want to introduce the decorator, that would make it look even worse than what we have now (we would have to have another internal method in a 5-level-indented 800 line method.

That would be a bit risky to do it now as we want to cherry-pick that one to 2.7 - and it will be difficult to solve any problems it might cause if we make a mistake of some sort.

I think I will leave it as "post-2.7.0" to refactor the backfill job. I created a task to do so: #33249 and maybe we can engage a few people to collaborate on that, I personally thing backfill job is one of the most neglected parts of Airflow and maybe for 2.8 we could have it back in order, easier to understand and reason about.

The dask_executor backfill tests started to fail recently more often due to backfill exception, and the likely cause for it is that it is now better parallelise execution and triggering of the deadlocks because of contention betwee dag_run state update and task state update had become much easier. While this PR does not fix the underlying issue, it catches the operational error where the deadlock occured during the backfill. and rolls back the operation. This **should** be safe. backfil has a built-in mechanism to loop and retry failed tasks and the test passed multiple times, completing the backfill after this fix was applied. It was not easy to reproduce it locally but it failed every 20-30 times. When extra logging was added, it was always connected to OperationalException raised (and caught) right after _per_task_process. The same exception was observed few times when rollback was added, and despite it backfill job retried and completed the process successfully every time. We also leave the logs with exceptions and add reassuring messages that should make it clear that in case backfill completes, the exceptions can be ignored as the updates will be retried by the backfill job. Fixes: #32778 (cherry picked from commit f616ee8)

boring-cyborg bot added area:dev-tools area:providers provider:daskexecutor provider:datadog labels Aug 1, 2023

potiuk force-pushed the chase-flaky-quarantine-test branch 5 times, most recently from f1ac81f to f7e9f89 Compare August 3, 2023 16:33

potiuk changed the title ~~Attempt to get logs from failed flaky bakfill test~~ Attempt to get logs from failed flaky backfill test Aug 3, 2023

potiuk added the full tests needed We need to run full set of tests for this PR to merge label Aug 3, 2023

potiuk closed this Aug 3, 2023

potiuk reopened this Aug 3, 2023

potiuk force-pushed the chase-flaky-quarantine-test branch 4 times, most recently from 7ddd389 to c9c37e1 Compare August 6, 2023 00:41

potiuk changed the title ~~Attempt to get logs from failed flaky backfill test~~ Workaround failing dedlock when running backfill Aug 6, 2023

potiuk requested review from ashb, uranusjr, ephraimbuddy, dstandish, hussein-awala, eladkal and jedcunningham August 6, 2023 00:43

potiuk marked this pull request as ready for review August 6, 2023 00:43

potiuk requested review from kaxil and XD-DENG as code owners August 6, 2023 00:43

potiuk added this to the Airflow 2.7.0 milestone Aug 6, 2023

potiuk force-pushed the chase-flaky-quarantine-test branch from bb0f94a to e0b1c35 Compare August 6, 2023 12:04

potiuk force-pushed the chase-flaky-quarantine-test branch from e0b1c35 to aa8eab7 Compare August 6, 2023 12:46

potiuk changed the title ~~Workaround failing dedlock when running backfill~~ Workaround failing deadlock when running backfill Aug 6, 2023

potiuk force-pushed the chase-flaky-quarantine-test branch from aa8eab7 to f63131f Compare August 6, 2023 13:51

potiuk force-pushed the chase-flaky-quarantine-test branch from f63131f to a32376f Compare August 6, 2023 14:31

potiuk closed this Aug 6, 2023

potiuk reopened this Aug 6, 2023

potiuk force-pushed the chase-flaky-quarantine-test branch 3 times, most recently from 8f8803b to e9d85c7 Compare August 8, 2023 08:33

hussein-awala reviewed Aug 8, 2023

View reviewed changes

airflow/jobs/backfill_job_runner.py Outdated Show resolved Hide resolved

potiuk force-pushed the chase-flaky-quarantine-test branch 2 times, most recently from a31a13c to de5c147 Compare August 8, 2023 23:55

potiuk force-pushed the chase-flaky-quarantine-test branch from de5c147 to 048c9b9 Compare August 8, 2023 23:56

ephraimbuddy modified the milestones: Airflow 2.7.0, Airflow 2.7.1 Aug 9, 2023

ephraimbuddy approved these changes Aug 9, 2023

View reviewed changes

ephraimbuddy modified the milestones: Airflow 2.7.1, Airflow 2.7.0 Aug 9, 2023

potiuk merged commit f616ee8 into apache:main Aug 9, 2023
61 checks passed

potiuk deleted the chase-flaky-quarantine-test branch August 9, 2023 11:00

ephraimbuddy added the type:improvement Changelog: Improvements label Aug 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workaround failing deadlock when running backfill #32991

Workaround failing deadlock when running backfill #32991

potiuk commented Aug 1, 2023 •

edited

potiuk commented Aug 6, 2023 •

edited

potiuk commented Aug 6, 2023

potiuk commented Aug 6, 2023

potiuk commented Aug 8, 2023

uranusjr commented Aug 9, 2023

potiuk commented Aug 9, 2023

potiuk commented Aug 9, 2023

Workaround failing deadlock when running backfill #32991

Workaround failing deadlock when running backfill #32991

Conversation

potiuk commented Aug 1, 2023 • edited

potiuk commented Aug 6, 2023 • edited

potiuk commented Aug 6, 2023

potiuk commented Aug 6, 2023

potiuk commented Aug 8, 2023

uranusjr commented Aug 9, 2023

potiuk commented Aug 9, 2023

potiuk commented Aug 9, 2023

potiuk commented Aug 1, 2023 •

edited

potiuk commented Aug 6, 2023 •

edited