You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The MU has no way to recover from errors when messages fail. However modifications have been made to set the stage for this change to be possible. All MU messages now operate in a worker which processes a queue.
Problem
The worker has no persistence or mechanism to track the lifecycle of messages.
Solution
There are 2 parts to the solution
Add a retry mechanism. When processMsg, processSpawn, or processAssign fails for some reason it should be put back on the queue and tried again AT A PARTICULAR STAGE. So we will need to add a mechanism that tracks the stage of the processing and then the processing can be picked back up at that stage. Possibly just a parameter into the business logic called "stage" that could be fed back in later. It would also be returned from the business logic so you could store it and try again later at that stage. So for example we could call processMsg({stage: 'begin', other params....}) then if succeeded process message would have in its return {stage: 'complete'}. But if it fails somewhere along the way it would return something like {stage: 'su-send'} and this would be fed back in on the retry. The begin and complete stages could also be implicit and assumed if the stage is not present. This is only one option and possibly not the best for adding retries there are definitely more.
The second part of this solution involves persisting the queue. There is a worker per core now in the mu. There is a queue in each worker. The queue should be made persistent using something like sqlite which could be saved across deployments and restarts this way messages would not be lost. There is a queueId parameter passed in to the worker when it starts, this should be used to track which persistent queue a worker will initialize and work on for when there are multiple cores/multiple queues. There is a consideration there if the number of cores shrank across deployments the persistent queues would have to be combined.
Definition of Done
However it is accomplished here are the end goals
If the processing of a message fails somewhere along the way and it is a recoverable failure it should be retried until successful or it is determined that the failure is not recoverable.
Message queues should last across restarts and deploys so that there isn't message loss when these things happen which is frequently.
The persistence mechanism should be local to the MU, infrastructure will handle persisting across deployments.
Some reported cases of message loss
There are various examples of MU messages failing along the way.
The MU needs a way to track and retry messages.
Users are reporting failures in manually triggered message processes (from the sdk)
Background
The MU has no way to recover from errors when messages fail. However modifications have been made to set the stage for this change to be possible. All MU messages now operate in a worker which processes a queue.
Problem
The worker has no persistence or mechanism to track the lifecycle of messages.
Solution
There are 2 parts to the solution
Add a retry mechanism. When processMsg, processSpawn, or processAssign fails for some reason it should be put back on the queue and tried again AT A PARTICULAR STAGE. So we will need to add a mechanism that tracks the stage of the processing and then the processing can be picked back up at that stage. Possibly just a parameter into the business logic called "stage" that could be fed back in later. It would also be returned from the business logic so you could store it and try again later at that stage. So for example we could call processMsg({stage: 'begin', other params....}) then if succeeded process message would have in its return {stage: 'complete'}. But if it fails somewhere along the way it would return something like {stage: 'su-send'} and this would be fed back in on the retry. The begin and complete stages could also be implicit and assumed if the stage is not present. This is only one option and possibly not the best for adding retries there are definitely more.
The second part of this solution involves persisting the queue. There is a worker per core now in the mu. There is a queue in each worker. The queue should be made persistent using something like sqlite which could be saved across deployments and restarts this way messages would not be lost. There is a queueId parameter passed in to the worker when it starts, this should be used to track which persistent queue a worker will initialize and work on for when there are multiple cores/multiple queues. There is a consideration there if the number of cores shrank across deployments the persistent queues would have to be combined.
Definition of Done
However it is accomplished here are the end goals
Some reported cases of message loss
There are various examples of MU messages failing along the way.
The MU needs a way to track and retry messages.
Users are reporting failures in manually triggered message processes (from the sdk)
https://ao_marton.g8way.io/#/process/K_-yDwh-rzcYTLVYvP1WT2T78OJe7dW_Q2AgGKcE6OA
Some more examples
a transfer with no Credit-Notice, but a Debit-Notice https://ao_marton.g8way.io/#/message/yq3crRAYAAc2inum1tZMizVwmcRIbNQRcvekAbW6nTE
Get-Price response that arrived successfully; had its handler execution create a Transfer message; but the Transfer message never went https://ao_marton.g8way.io/#/message/OHEo8Zz2HCVTDZ56NFew1O1nwUQkKcVFjweljXsw1w4
Messages are also reported as being heavily delayed multiple minutes
The text was updated successfully, but these errors were encountered: