Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(mu): MU Message Reliability #661

Open
VinceJuliano opened this issue May 1, 2024 · 0 comments
Open

fix(mu): MU Message Reliability #661

VinceJuliano opened this issue May 1, 2024 · 0 comments
Assignees
Labels
mu ao Messenger unit

Comments

@VinceJuliano
Copy link
Collaborator

VinceJuliano commented May 1, 2024

Background

The MU has no way to recover from errors when messages fail. However modifications have been made to set the stage for this change to be possible. All MU messages now operate in a worker which processes a queue.

Problem

The worker has no persistence or mechanism to track the lifecycle of messages.

Solution

There are 2 parts to the solution

  1. Add a retry mechanism. When processMsg, processSpawn, or processAssign fails for some reason it should be put back on the queue and tried again AT A PARTICULAR STAGE. So we will need to add a mechanism that tracks the stage of the processing and then the processing can be picked back up at that stage. Possibly just a parameter into the business logic called "stage" that could be fed back in later. It would also be returned from the business logic so you could store it and try again later at that stage. So for example we could call processMsg({stage: 'begin', other params....}) then if succeeded process message would have in its return {stage: 'complete'}. But if it fails somewhere along the way it would return something like {stage: 'su-send'} and this would be fed back in on the retry. The begin and complete stages could also be implicit and assumed if the stage is not present. This is only one option and possibly not the best for adding retries there are definitely more.

  2. The second part of this solution involves persisting the queue. There is a worker per core now in the mu. There is a queue in each worker. The queue should be made persistent using something like sqlite which could be saved across deployments and restarts this way messages would not be lost. There is a queueId parameter passed in to the worker when it starts, this should be used to track which persistent queue a worker will initialize and work on for when there are multiple cores/multiple queues. There is a consideration there if the number of cores shrank across deployments the persistent queues would have to be combined.

Definition of Done

However it is accomplished here are the end goals

  1. If the processing of a message fails somewhere along the way and it is a recoverable failure it should be retried until successful or it is determined that the failure is not recoverable.
  2. Message queues should last across restarts and deploys so that there isn't message loss when these things happen which is frequently.
  3. The persistence mechanism should be local to the MU, infrastructure will handle persisting across deployments.

Some reported cases of message loss

There are various examples of MU messages failing along the way.

The MU needs a way to track and retry messages.

Users are reporting failures in manually triggered message processes (from the sdk)

https://ao_marton.g8way.io/#/process/K_-yDwh-rzcYTLVYvP1WT2T78OJe7dW_Q2AgGKcE6OA

Some more examples
a transfer with no Credit-Notice, but a Debit-Notice https://ao_marton.g8way.io/#/message/yq3crRAYAAc2inum1tZMizVwmcRIbNQRcvekAbW6nTE

Get-Price response that arrived successfully; had its handler execution create a Transfer message; but the Transfer message never went https://ao_marton.g8way.io/#/message/OHEo8Zz2HCVTDZ56NFew1O1nwUQkKcVFjweljXsw1w4

Messages are also reported as being heavily delayed multiple minutes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mu ao Messenger unit
Projects
None yet
Development

No branches or pull requests

1 participant