You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If a step function is started successfully it seems we should never set the state to FAILED, which could happen if setting the state to PROCESSING fails with an exception. However, this could lead to a state DB entry that succeeds or fails with no execution ARN set.
If we leave things as they are, then it is possible we end up processing a payload twice at the same time, because the FAILED state will not gate a re-execution (and the SQS redrive will mean we attempt processing again for the same payload).
One possible solution to this issue is to set the execution ARN with update-state for the terminal state updates at the end of an execution. However, the execution ARNs are stored in an array on the state record, so it is not trivial to ensure that array is updated correctly without pulling that field down and iterating through it to check for the presence of the current ARN.
If we were to not set PROCESSING in process (related: #188), we could use update-state to set PROCESSING for step function start events. A critical problem here though is that event processing is not guaranteed to happen in order; we could enforce state change ordering to ensure we end up with the correct state (i.e., we don't move state backwards), except this potentially breaks down or has its own race conditions when we consider reprocessing (how can we know if the event is for the most recent execution and should be the one to take precedence?). In practice this might not be a real issue, as if events are so drastically delayed it is also likely that AWS services are impacted enough to cause other major problems.
One workaround that would mitigate this problem, however, would be to switch the state DB schema to per-execution state entries, rather than a single state entry per item. Queries become potentially complicated by this however, and this model might only work well in an RDBMS.
The text was updated successfully, but these errors were encountered:
If a step function is started successfully it seems we should never set the state to
FAILED
, which could happen if setting the state toPROCESSING
fails with an exception. However, this could lead to a state DB entry that succeeds or fails with no execution ARN set.If we leave things as they are, then it is possible we end up processing a payload twice at the same time, because the
FAILED
state will not gate a re-execution (and the SQS redrive will mean we attempt processing again for the same payload).One possible solution to this issue is to set the execution ARN with
update-state
for the terminal state updates at the end of an execution. However, the execution ARNs are stored in an array on the state record, so it is not trivial to ensure that array is updated correctly without pulling that field down and iterating through it to check for the presence of the current ARN.If we were to not set
PROCESSING
inprocess
(related: #188), we could useupdate-state
to setPROCESSING
for step function start events. A critical problem here though is that event processing is not guaranteed to happen in order; we could enforce state change ordering to ensure we end up with the correct state (i.e., we don't move state backwards), except this potentially breaks down or has its own race conditions when we consider reprocessing (how can we know if the event is for the most recent execution and should be the one to take precedence?). In practice this might not be a real issue, as if events are so drastically delayed it is also likely that AWS services are impacted enough to cause other major problems.One workaround that would mitigate this problem, however, would be to switch the state DB schema to per-execution state entries, rather than a single state entry per item. Queries become potentially complicated by this however, and this model might only work well in an RDBMS.
The text was updated successfully, but these errors were encountered: