`ProcessPayload()` has a race condition due to setting failed on any exceptions #189

jkeifer · 2022-11-02T22:29:03Z

If a step function is started successfully it seems we should never set the state to FAILED, which could happen if setting the state to PROCESSING fails with an exception. However, this could lead to a state DB entry that succeeds or fails with no execution ARN set.

If we leave things as they are, then it is possible we end up processing a payload twice at the same time, because the FAILED state will not gate a re-execution (and the SQS redrive will mean we attempt processing again for the same payload).

One possible solution to this issue is to set the execution ARN with update-state for the terminal state updates at the end of an execution. However, the execution ARNs are stored in an array on the state record, so it is not trivial to ensure that array is updated correctly without pulling that field down and iterating through it to check for the presence of the current ARN.

If we were to not set PROCESSING in process (related: #188), we could use update-state to set PROCESSING for step function start events. A critical problem here though is that event processing is not guaranteed to happen in order; we could enforce state change ordering to ensure we end up with the correct state (i.e., we don't move state backwards), except this potentially breaks down or has its own race conditions when we consider reprocessing (how can we know if the event is for the most recent execution and should be the one to take precedence?). In practice this might not be a real issue, as if events are so drastically delayed it is also likely that AWS services are impacted enough to cause other major problems.

One workaround that would mitigate this problem, however, would be to switch the state DB schema to per-execution state entries, rather than a single state entry per item. Queries become potentially complicated by this however, and this model might only work well in an RDBMS.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ProcessPayload()` has a race condition due to setting failed on any exceptions #189

`ProcessPayload()` has a race condition due to setting failed on any exceptions #189

jkeifer commented Nov 2, 2022

ProcessPayload() has a race condition due to setting failed on any exceptions #189

ProcessPayload() has a race condition due to setting failed on any exceptions #189

Comments

jkeifer commented Nov 2, 2022

`ProcessPayload()` has a race condition due to setting failed on any exceptions #189

`ProcessPayload()` has a race condition due to setting failed on any exceptions #189