Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ProcessPayload() has a race condition due to setting failed on any exceptions #189

Open
jkeifer opened this issue Nov 2, 2022 · 0 comments

Comments

@jkeifer
Copy link
Collaborator

jkeifer commented Nov 2, 2022

If a step function is started successfully it seems we should never set the state to FAILED, which could happen if setting the state to PROCESSING fails with an exception. However, this could lead to a state DB entry that succeeds or fails with no execution ARN set.

If we leave things as they are, then it is possible we end up processing a payload twice at the same time, because the FAILED state will not gate a re-execution (and the SQS redrive will mean we attempt processing again for the same payload).

One possible solution to this issue is to set the execution ARN with update-state for the terminal state updates at the end of an execution. However, the execution ARNs are stored in an array on the state record, so it is not trivial to ensure that array is updated correctly without pulling that field down and iterating through it to check for the presence of the current ARN.

If we were to not set PROCESSING in process (related: #188), we could use update-state to set PROCESSING for step function start events. A critical problem here though is that event processing is not guaranteed to happen in order; we could enforce state change ordering to ensure we end up with the correct state (i.e., we don't move state backwards), except this potentially breaks down or has its own race conditions when we consider reprocessing (how can we know if the event is for the most recent execution and should be the one to take precedence?). In practice this might not be a real issue, as if events are so drastically delayed it is also likely that AWS services are impacted enough to cause other major problems.

One workaround that would mitigate this problem, however, would be to switch the state DB schema to per-execution state entries, rather than a single state entry per item. Queries become potentially complicated by this however, and this model might only work well in an RDBMS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant