AO Network Stability #651

twilson63 · 2024-04-29T14:31:52Z

The current implementation of the AO Computer is still struggling with stability, many of the issues is relating to the dependence on arweave graphql, with the optimistic cache. Here are the major pain point areas:

Spawning new processes - Getting graphql timeouts or 404's when creating a new process and trying to locate the appropriate SU. When the process id does not appear in the graphql result quickly, the network is unable to discover the location of the process.
Checkpoints - AO Checkpoints are used to pull the most recent state so when a process is reset, it does not have to build from the beginning. The current implementation is when a CU is being restarted to write checkpoints on exit, and read checkpoints on start (among other edge cases that will trigger a checkpoint). However, if the cu must query for checkpoints from the gateway, the gateway will often timeout or return 404s during this restart, so that the CU is unable to access the checkpoint memory to cache and has to re-evaluate from an earlier checkpoint, or worse, a coldstart.
Cron - Cron monitor is unreliable and not persistent. It should be made to persist and be tolerant to errors.
- Cron does not handle assignments feat(mu): mu modifications for Assignment change #566
- Cron has situations where it crashes and never continues for a process fix(cranker): cron has situations where it crashes trying to process results #649
- Cron persistence files are getting corrupted fix(MU): cron persist is corrupting files #672
- Cron delay and "duplicate eval streams"
- Cron user process example sSPSG2KzsE9kexV9TUHpiib24TbxEkM9sRZzLYv1Q20 does not send messages in cron
MU Issues
- Message Reliability fix(mu): MU Message Reliability #661
- Duplicate From process tag fix(mu): duplicate From-Process tags on transactions #660
- Oracle messages not working MU is no longer sending messages directly to Arweave for wallet addresses #652
SU Issues
- Intermittent SU timestamp issue fix(su): intermittent timestamp issue #681
CU Issues

perf(cu): move calculating Cron Messages between two boundaries to worker #472

Affected Use Cases

New AOS processes fail
Downtime when core processes get in this state, cred, trinity, etc. Because it takes them so long to evaluate from zero.

One last note is around 429s, when the CUs or MUs make too many request they get rate-limited which could be having an impact.

twilson63 assigned twilson63, TillaTheHun0 and VinceJuliano Apr 29, 2024

twilson63 transferred this issue from permaweb/aos Apr 29, 2024

jfrain99 self-assigned this May 28, 2024

jfrain99 linked a pull request May 29, 2024 that will close this issue

Vince juliano/mu 661 #743

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AO Network Stability #651

AO Network Stability #651

twilson63 commented Apr 29, 2024 •

edited by TillaTheHun0

AO Network Stability #651

AO Network Stability #651

Comments

twilson63 commented Apr 29, 2024 • edited by TillaTheHun0

Affected Use Cases

twilson63 commented Apr 29, 2024 •

edited by TillaTheHun0