fix(mu): MU Message Reliability #661

VinceJuliano · 2024-05-01T15:30:08Z

Background

The MU has no way to recover from errors when messages fail. However modifications have been made to set the stage for this change to be possible. All MU messages now operate in a worker which processes a queue.

Problem

The worker has no persistence or mechanism to track the lifecycle of messages.

Solution

There are 2 parts to the solution

Add a retry mechanism. When processMsg, processSpawn, or processAssign fails for some reason it should be put back on the queue and tried again AT A PARTICULAR STAGE. So we will need to add a mechanism that tracks the stage of the processing and then the processing can be picked back up at that stage. Possibly just a parameter into the business logic called "stage" that could be fed back in later. It would also be returned from the business logic so you could store it and try again later at that stage. So for example we could call processMsg({stage: 'begin', other params....}) then if succeeded process message would have in its return {stage: 'complete'}. But if it fails somewhere along the way it would return something like {stage: 'su-send'} and this would be fed back in on the retry. The begin and complete stages could also be implicit and assumed if the stage is not present. This is only one option and possibly not the best for adding retries there are definitely more.
The second part of this solution involves persisting the queue. There is a worker per core now in the mu. There is a queue in each worker. The queue should be made persistent using something like sqlite which could be saved across deployments and restarts this way messages would not be lost. There is a queueId parameter passed in to the worker when it starts, this should be used to track which persistent queue a worker will initialize and work on for when there are multiple cores/multiple queues. There is a consideration there if the number of cores shrank across deployments the persistent queues would have to be combined.

Definition of Done

However it is accomplished here are the end goals

If the processing of a message fails somewhere along the way and it is a recoverable failure it should be retried until successful or it is determined that the failure is not recoverable.
Message queues should last across restarts and deploys so that there isn't message loss when these things happen which is frequently.
The persistence mechanism should be local to the MU, infrastructure will handle persisting across deployments.

Some reported cases of message loss

There are various examples of MU messages failing along the way.

The MU needs a way to track and retry messages.

Users are reporting failures in manually triggered message processes (from the sdk)

https://ao_marton.g8way.io/#/process/K_-yDwh-rzcYTLVYvP1WT2T78OJe7dW_Q2AgGKcE6OA

Some more examples
a transfer with no Credit-Notice, but a Debit-Notice https://ao_marton.g8way.io/#/message/yq3crRAYAAc2inum1tZMizVwmcRIbNQRcvekAbW6nTE

Get-Price response that arrived successfully; had its handler execution create a Transfer message; but the Transfer message never went https://ao_marton.g8way.io/#/message/OHEo8Zz2HCVTDZ56NFew1O1nwUQkKcVFjweljXsw1w4

Messages are also reported as being heavily delayed multiple minutes

VinceJuliano added the mu ao Messenger unit label May 1, 2024

VinceJuliano self-assigned this May 1, 2024

VinceJuliano mentioned this issue Apr 29, 2024

AO Network Stability #651

Open

10 tasks

VinceJuliano added a commit that referenced this issue May 23, 2024

feat(mu): results being pushed by the worker thread #661

09435c7

VinceJuliano added a commit that referenced this issue May 23, 2024

chore(mu): remove cranking files #661

8c86a7b

VinceJuliano added a commit that referenced this issue May 28, 2024

feat(mu): move cron into the server #661

91cc869

jfrain99 pushed a commit that referenced this issue May 29, 2024

feat(mu): results being pushed by the worker thread #661

2fd7d69

jfrain99 pushed a commit that referenced this issue May 29, 2024

chore(mu): remove cranking files #661

e21ddad

jfrain99 pushed a commit that referenced this issue May 29, 2024

feat(mu): move cron into the server #661

90d1b16

This was referenced May 29, 2024

Vince juliano/mu 661 #743

Merged

fix(mu): change try catch to .catch in mu worker #746

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mu): MU Message Reliability #661

fix(mu): MU Message Reliability #661

VinceJuliano commented May 1, 2024 •

edited

fix(mu): MU Message Reliability #661

fix(mu): MU Message Reliability #661

Comments

VinceJuliano commented May 1, 2024 • edited

Background

Problem

Solution

Definition of Done

Some reported cases of message loss

VinceJuliano commented May 1, 2024 •

edited